Engineering

Engineering

6 Techniques to Cut AI Agent Memory Cost Beyond Basic Retrieval

TL;DR:

  • Retrieval-based memory cuts prompt tokens significantly over naive file injection on small stores. These six techniques stack on top of that baseline.

  • The biggest API-verified prompt-token wins are token budgeting (−75%), hierarchical summarization (−59%), and Ebbinghaus eviction (−59%), all measured against the same 24-entry store with token counts.

  • Techniques 4–6 do not reduce prompt tokens. They target vector index size, retrieval precision, and RAM footprint, which matter just as much at scale.

  • All six are implemented in the GitHub repo.

————

Switching from naive file-based memory injection to retrieval-based memory can meaningfully reduce prompt tokens on small stores, though savings vary based on store size and query patterns. In the companion comparator, Mem0-backed retrieval cut a 24-entry Hermes memory prompt from 594 tokens to 166 tokens on the same query. If you want the full breakdown of why that works and how to set it up, that is covered in the latest blog.

Retrieval-based memory is not the end of the optimization story. As your agent runs longer, a different set of failure modes starts appearing:

  • Stale entries inflating prompt costs as the store grows

  • Duplicate memories degrading retrieval precision over time

  • Vector indexes consuming RAM on constrained hardware

  • Retrieval latency accumulating across thousands of daily calls

These six techniques are independent layers you can stack on top of your existing retrieval architecture. Each targets a specific failure mode, and each is measured with the right metric for what it actually optimizes:

  • Prompt-shaping techniques (1–3) are verified with API token counts against the same 24-entry store

  • Storage and infrastructure techniques (4–6) use storage math, similarity scans, and cache proxies

Naive Injection vs Retrieval-Based Memory

Before getting into the advanced techniques, here is the retrieval baseline measured against a 24-entry Hermes memory store, same query and same model:

Method

Prompt Tokens

Savings

Naive Hermes (full file dump)

594

Hermes + Mem0 retrieval (top-5)

166

−72%

Hermes + Mem0 retrieval (top-10)

293

−51%

This is the floor on a small 24-entry store. Savings will vary at larger store sizes. The six techniques below push costs lower or address failure modes that token count alone does not capture.

Quick Summary

Techniques 1–3 are verified with real API token counts using openai/gpt-4o-mini, on same 24-entry store and same query. Techniques 4–6 use storage math, similarity scans, and cache proxies as noted:

Technique

Prompt Tokens

Savings vs Baseline (600)

Metric type

Token budgeting (budget=80, selected=3)

149

75.17%

Real API token count

Hierarchical summarization

247

58.83%

Real API token count

Ebbinghaus eviction (9/24 retained)

249

58.50%

Real API token count

Embedding quantization

n/a

4× smaller index

Storage math

Self-curation

n/a

1 merge candidate

Similarity scan

Hot/cold caching

n/a

83.3% RAM proxy

Simulated cache

These percentages should not be added together. Each technique was measured as an alternative memory-shaping strategy against the same naive baseline, not as a cumulative pipeline.

How to Reproduce These Numbers?

The token counts for Techniques 1–3 come from verify_advanced_api.py, a dedicated verification script in the repo that calls OpenRouter(you can use OpenAI or any other API provider) with each prompt variant and reads back real usage.prompt_tokens.

To run it against your own memory directory:

python verify_advanced_api.py \\
  --memory-dir examples/hermes-memory \\
  --user-id hermes-advanced-api-1 \\
  --memory-limit 10 \\
  --advanced-budget-tokens 80
python verify_advanced_api.py \\
  --memory-dir examples/hermes-memory \\
  --user-id hermes-advanced-api-1 \\
  --memory-limit 10 \\
  --advanced-budget-tokens 80
python verify_advanced_api.py \\
  --memory-dir examples/hermes-memory \\
  --user-id hermes-advanced-api-1 \\
  --memory-limit 10 \\
  --advanced-budget-tokens 80

The script seeds Mem0 with your Hermes entries, retrieves memories for the query, builds four prompt variants (naive baseline, token budgeting, hierarchical summary, Ebbinghaus filtered), sends each to the same model, and prints the token counts side by side. The numbers in this article came from running it on a 24-entry store with openai/gpt-4o-mini.

⭐️ Checkout this repo for full code

Let’s understand these techniques one by one:

  1. Token Budgeting (−75% Prompt Tokens)

Token budgeting is the highest-leverage technique in this list and the simplest to implement. You set a hard token ceiling for the memory context, fill it from your retrieved results in priority order, stop when the budget is hit, and append a short note telling the model how many memories were omitted. The model knows it is working with a constrained view rather than silently receiving incomplete context.

It decouples prompt token cost from both query complexity and memory store size. Without a budget, a broad query that retrieves 10 semantically relevant entries injects all 10 regardless of whether entries 6 through 10 actually add anything. With a budget, you control the ceiling exactly. Peak memory token cost becomes predictable regardless of store size or query type.

Measured result: We found that budget=80, selected=3 memories, overflow=7, led to prompt tokens=149 which is 75% reduction compared to naive baseline of 600.

Snippets below are abbreviated from the comparator. See the repo for imports and full runnable context.

def budget_memories(
    memories: list[str], token_budget: int, model_hint: str
) -> BudgetedContext:
    selected: list[str] = []
    used = 0
    overflow = 0

    for memory in memories:
        tokens = estimate_tokens(memory, model_hint)
        if used + tokens <= token_budget:
            selected.append(memory)
            used += tokens
        else:
            overflow += 1

    context = "\\n".join(f"- {m}" for m in selected)
    if overflow:
        context += (
            f"\\n\\n[Note: {overflow} additional relevant memories were omitted "
            "to stay within the memory token budget.]"
        )
    if not context:
        context = "- No relevant Mem0 memories found."

    return BudgetedContext(
        memories=selected, context=context,
        tokens_used=used, overflow_count=overflow,
    )
def budget_memories(
    memories: list[str], token_budget: int, model_hint: str
) -> BudgetedContext:
    selected: list[str] = []
    used = 0
    overflow = 0

    for memory in memories:
        tokens = estimate_tokens(memory, model_hint)
        if used + tokens <= token_budget:
            selected.append(memory)
            used += tokens
        else:
            overflow += 1

    context = "\\n".join(f"- {m}" for m in selected)
    if overflow:
        context += (
            f"\\n\\n[Note: {overflow} additional relevant memories were omitted "
            "to stay within the memory token budget.]"
        )
    if not context:
        context = "- No relevant Mem0 memories found."

    return BudgetedContext(
        memories=selected, context=context,
        tokens_used=used, overflow_count=overflow,
    )
def budget_memories(
    memories: list[str], token_budget: int, model_hint: str
) -> BudgetedContext:
    selected: list[str] = []
    used = 0
    overflow = 0

    for memory in memories:
        tokens = estimate_tokens(memory, model_hint)
        if used + tokens <= token_budget:
            selected.append(memory)
            used += tokens
        else:
            overflow += 1

    context = "\\n".join(f"- {m}" for m in selected)
    if overflow:
        context += (
            f"\\n\\n[Note: {overflow} additional relevant memories were omitted "
            "to stay within the memory token budget.]"
        )
    if not context:
        context = "- No relevant Mem0 memories found."

    return BudgetedContext(
        memories=selected, context=context,
        tokens_used=used, overflow_count=overflow,
    )

The overflow note is not cosmetic. Without it, the model has no signal that its memory context is incomplete, which can produce overconfident responses on queries where the omitted memories were actually relevant. Telling the model explicitly keeps its outputs calibrated.

Where to set the budget: Start at 500–800 tokens for typical personal assistant or in-home agents. For multi-domain agents handling varied query types, instrument your p95 memory token usage over a week and set the budget at 120 percent of that value. This gives you a ceiling that handles outlier queries without over-constraining normal ones.

  1. Hierarchical Summarization (−59% Prompt Tokens)

Instead of injecting raw memory entries, build a two-level summary (per-source session) that compress individual entries, rolling up into a single long-horizon summary that captures durable behavioral patterns. Each level is more compressed than the one below.

Raw memory entries carry a lot of incidental phrasing from the original conversation, including stop words, hedging language, and repeated context. Summarization extracts the signal and discards the noise. The long-horizon layer additionally captures patterns that span sessions and would otherwise require multiple entries to reconstruct at retrieval time.

Measured result (real API token count): Reduction of prompt tokens from 600 to just 247 prompt tokens (−59%).

def build_hierarchical_summary(entries: list[tuple[str, dict[str, str]]]) -> str:
    by_source: dict[str, list[str]] = {}
    for text, metadata in entries:
        by_source.setdefault(metadata["source"], []).append(text)

    session_summaries = []
    for source, texts in by_source.items():
        # Rank by importance score, keep top 4 per source
        important = sorted(texts, key=entry_importance, reverse=True)[:4]
        session_summaries.append(f"{source}: " + " ".join(important))

    long_horizon = (
        "Long-horizon summary: user values concise, reversible home-automation "
        "recommendations; nighttime lighting should be dim; safety-critical devices "
        "should not be changed without explicit approval."
    )
    return "\\n".join([*session_summaries, long_horizon])
def build_hierarchical_summary(entries: list[tuple[str, dict[str, str]]]) -> str:
    by_source: dict[str, list[str]] = {}
    for text, metadata in entries:
        by_source.setdefault(metadata["source"], []).append(text)

    session_summaries = []
    for source, texts in by_source.items():
        # Rank by importance score, keep top 4 per source
        important = sorted(texts, key=entry_importance, reverse=True)[:4]
        session_summaries.append(f"{source}: " + " ".join(important))

    long_horizon = (
        "Long-horizon summary: user values concise, reversible home-automation "
        "recommendations; nighttime lighting should be dim; safety-critical devices "
        "should not be changed without explicit approval."
    )
    return "\\n".join([*session_summaries, long_horizon])
def build_hierarchical_summary(entries: list[tuple[str, dict[str, str]]]) -> str:
    by_source: dict[str, list[str]] = {}
    for text, metadata in entries:
        by_source.setdefault(metadata["source"], []).append(text)

    session_summaries = []
    for source, texts in by_source.items():
        # Rank by importance score, keep top 4 per source
        important = sorted(texts, key=entry_importance, reverse=True)[:4]
        session_summaries.append(f"{source}: " + " ".join(important))

    long_horizon = (
        "Long-horizon summary: user values concise, reversible home-automation "
        "recommendations; nighttime lighting should be dim; safety-critical devices "
        "should not be changed without explicit approval."
    )
    return "\\n".join([*session_summaries, long_horizon])

The entry_importance function determines which entries make it into the session summary. It scores entries by keyword presence, where safety and security terms score highest (0.95) and deployment and demo terms score lowest (0.45), so the summary prioritizes durable behavioral preferences over transient operational notes.

The tradeoff: Summarization loses granularity. A user asking about a specific past decision may get a more general response from a summary than from a raw entry that contained the exact detail. Hierarchical summarization works best for preference and behavioral memory, less well for episodic recall of specific events. If your agent needs both, run summarization for preference-type entries and retrieval for episodic ones, then compose the two contexts before the inference call.

  1. Importance-Based Eviction with Ebbinghaus Decay (−59% Prompt Tokens)

Ebbinghaus decay models memories directly by assigning each entry an importance score based on its content, apply an exponential decay function based on age and access frequency, and move entries that fall below a threshold out of the active prompt set.

Stale entries are the most common source of silent context bloat in long-running agents. An agent running for three months has accumulated entries that were relevant in week one and have not been accessed since. Without active eviction, those entries accumulate in every prompt indefinitely. Decay-based eviction can move them out of the active set automatically as their relevance expires. In production, the recommended approach is to demote these entries to cold storage first rather than deleting them, so they can be recovered if the threshold was set too aggressively.

Measured result (real API token count): 9 of 24 entries retained and prompt tokens reduced from 600 to 249 (−59%). In the fixture, the entries filtered out were mostly deployment notes, demo configuration, and lower-priority context. In production, which entries get moved out depends on your real access metadata, such as age and access counts, which the script simulates based on entry position and keyword presence.

def entry_importance(text: str) -> float:
    lowered = text.lower()
    if any(w in lowered for w in ("safety", "lock", "alarm", "security", "critical")):
        return 0.95
    if any(w in lowered for w in ("prefers", "should", "do not", "unless", "wants")):
        return 0.75
    if any(w in lowered for w in ("deployment", "project", "demo")):
        return 0.45
    return 0.55

def ebbinghaus_score(
    importance: float, created_at: datetime,
    access_count: int, now: datetime, stability: float = 1.0,
) -> float:
    days_elapsed = max(0, (now - created_at).days)
    decay = math.exp(-days_elapsed / (stability * 10))
    return importance * decay * (1 + 0.1 * access_count)

def evict_with_decay(
    entries: list[tuple[str, dict[str, str]]],
    threshold: float, now: datetime,
) -> list[tuple[str, dict[str, str]]]:
    retained = []
    for index, (text, metadata) in enumerate(entries):
        age_days = 3 + (index % 8) * 6
        access_count = (
            4 if any(t in text.lower() for t in ("porch", "outside", "lights"))
            else index % 3
        )
        score = ebbinghaus_score(
            importance=entry_importance(text),
            created_at=now - timedelta(days=age_days),
            access_count=access_count,
            now=now,
        )
        if score > threshold:
            retained.append((text, metadata))
    return retained
def entry_importance(text: str) -> float:
    lowered = text.lower()
    if any(w in lowered for w in ("safety", "lock", "alarm", "security", "critical")):
        return 0.95
    if any(w in lowered for w in ("prefers", "should", "do not", "unless", "wants")):
        return 0.75
    if any(w in lowered for w in ("deployment", "project", "demo")):
        return 0.45
    return 0.55

def ebbinghaus_score(
    importance: float, created_at: datetime,
    access_count: int, now: datetime, stability: float = 1.0,
) -> float:
    days_elapsed = max(0, (now - created_at).days)
    decay = math.exp(-days_elapsed / (stability * 10))
    return importance * decay * (1 + 0.1 * access_count)

def evict_with_decay(
    entries: list[tuple[str, dict[str, str]]],
    threshold: float, now: datetime,
) -> list[tuple[str, dict[str, str]]]:
    retained = []
    for index, (text, metadata) in enumerate(entries):
        age_days = 3 + (index % 8) * 6
        access_count = (
            4 if any(t in text.lower() for t in ("porch", "outside", "lights"))
            else index % 3
        )
        score = ebbinghaus_score(
            importance=entry_importance(text),
            created_at=now - timedelta(days=age_days),
            access_count=access_count,
            now=now,
        )
        if score > threshold:
            retained.append((text, metadata))
    return retained
def entry_importance(text: str) -> float:
    lowered = text.lower()
    if any(w in lowered for w in ("safety", "lock", "alarm", "security", "critical")):
        return 0.95
    if any(w in lowered for w in ("prefers", "should", "do not", "unless", "wants")):
        return 0.75
    if any(w in lowered for w in ("deployment", "project", "demo")):
        return 0.45
    return 0.55

def ebbinghaus_score(
    importance: float, created_at: datetime,
    access_count: int, now: datetime, stability: float = 1.0,
) -> float:
    days_elapsed = max(0, (now - created_at).days)
    decay = math.exp(-days_elapsed / (stability * 10))
    return importance * decay * (1 + 0.1 * access_count)

def evict_with_decay(
    entries: list[tuple[str, dict[str, str]]],
    threshold: float, now: datetime,
) -> list[tuple[str, dict[str, str]]]:
    retained = []
    for index, (text, metadata) in enumerate(entries):
        age_days = 3 + (index % 8) * 6
        access_count = (
            4 if any(t in text.lower() for t in ("porch", "outside", "lights"))
            else index % 3
        )
        score = ebbinghaus_score(
            importance=entry_importance(text),
            created_at=now - timedelta(days=age_days),
            access_count=access_count,
            now=now,
        )
        if score > threshold:
            retained.append((text, metadata))
    return retained

Threshold tuning: The script defaults to --eviction-threshold 0.15. Start there and monitor your agent's behavior for two to three weeks before adjusting. Setting it too high evicts memories that still matter for safety preferences and long-standing behavioral patterns in particular. A useful production check: after each eviction pass, query your agent about its most-accessed preferences and verify the answers are still accurate. If they degrade, lower the threshold or increase the stability parameter for high-importance entry categories.

In production, age_days and access_count should come from your memory store's metadata rather than being simulated. If your store does not currently track access timestamps and access counts, adding that instrumentation is the prerequisite before enabling eviction.

  1. Embedding Quantization (4× Storage Reduction)

This technique does not reduce prompt tokens. It reduces the RAM and disk footprint of your vector index, which is the primary bottleneck for agents running on constrained hardware like Raspberry Pi 4/5 or self-hosted homelab setups.

Standard float32 embeddings use 4 bytes per dimension. A 512-dimension embedding for a single memory entry costs 2,048 bytes. Across 10,000 entries that is 20 MB of float32 vectors just for the index. Int8 quantization maps each float value to a signed byte, cutting storage to 512 bytes per entry, so the same 10,000-entry store fits in 5 MB instead of 20 MB, a 4× reduction.

Measured result: We found that float32 requires 49,152 bytes for 24 entries, while int8 requires just 12,288 bytes, which is 4.0× smaller. The synthetic quantization check showed reconstruction error near zero. In production, validate recall with your actual embedding model and query set before assuming retrieval ranking is unchanged. Int8 quantization is often a strong default for memory-style retrieval, but it should be confirmed on your specific workload before deploying.

def quantization_metrics(
    entries: list[tuple[str, dict[str, str]]], dimensions: int = 512
) -> dict[str, float]:
    float_bytes = len(entries) * dimensions * 4
    int8_bytes = len(entries) * dimensions
    cosine_errors = []

    for text, _ in entries[:min(8, len(entries))]:
        vector = deterministic_embedding(text, dimensions)
        max_abs = max(abs(v) for v in vector) or 1.0
        scale = 127.0 / max_abs
        quantized = [round(v * scale) for v in vector]
        restored = [v / scale for v in quantized]
        dot = sum(a * b for a, b in zip(vector, restored))
        norm_a = math.sqrt(sum(a * a for a in vector))
        norm_b = math.sqrt(sum(b * b for b in restored))
        cosine = dot / (norm_a * norm_b) if norm_a and norm_b else 0.0
        cosine_errors.append(1 - cosine)

    avg_error = sum(cosine_errors) / len(cosine_errors) if cosine_errors else 0.0
    return {
        "float_bytes": float_bytes,
        "int8_bytes": int8_bytes,
        "reduction": float_bytes / int8_bytes if int8_bytes else 0.0,
        "avg_cosine_error": avg_error,
    }
def quantization_metrics(
    entries: list[tuple[str, dict[str, str]]], dimensions: int = 512
) -> dict[str, float]:
    float_bytes = len(entries) * dimensions * 4
    int8_bytes = len(entries) * dimensions
    cosine_errors = []

    for text, _ in entries[:min(8, len(entries))]:
        vector = deterministic_embedding(text, dimensions)
        max_abs = max(abs(v) for v in vector) or 1.0
        scale = 127.0 / max_abs
        quantized = [round(v * scale) for v in vector]
        restored = [v / scale for v in quantized]
        dot = sum(a * b for a, b in zip(vector, restored))
        norm_a = math.sqrt(sum(a * a for a in vector))
        norm_b = math.sqrt(sum(b * b for b in restored))
        cosine = dot / (norm_a * norm_b) if norm_a and norm_b else 0.0
        cosine_errors.append(1 - cosine)

    avg_error = sum(cosine_errors) / len(cosine_errors) if cosine_errors else 0.0
    return {
        "float_bytes": float_bytes,
        "int8_bytes": int8_bytes,
        "reduction": float_bytes / int8_bytes if int8_bytes else 0.0,
        "avg_cosine_error": avg_error,
    }
def quantization_metrics(
    entries: list[tuple[str, dict[str, str]]], dimensions: int = 512
) -> dict[str, float]:
    float_bytes = len(entries) * dimensions * 4
    int8_bytes = len(entries) * dimensions
    cosine_errors = []

    for text, _ in entries[:min(8, len(entries))]:
        vector = deterministic_embedding(text, dimensions)
        max_abs = max(abs(v) for v in vector) or 1.0
        scale = 127.0 / max_abs
        quantized = [round(v * scale) for v in vector]
        restored = [v / scale for v in quantized]
        dot = sum(a * b for a, b in zip(vector, restored))
        norm_a = math.sqrt(sum(a * a for a in vector))
        norm_b = math.sqrt(sum(b * b for b in restored))
        cosine = dot / (norm_a * norm_b) if norm_a and norm_b else 0.0
        cosine_errors.append(1 - cosine)

    avg_error = sum(cosine_errors) / len(cosine_errors) if cosine_errors else 0.0
    return {
        "float_bytes": float_bytes,
        "int8_bytes": int8_bytes,
        "reduction": float_bytes / int8_bytes if int8_bytes else 0.0,
        "avg_cosine_error": avg_error,
    }

Where this matters most: A Raspberry Pi 4 with 4 GB RAM running a local agent alongside a local LLM has very little headroom for a vector index. Cutting index RAM by 75 percent can be the difference between a stable deployment and one that swaps constantly. For cloud deployments with abundant RAM it is primarily a cost optimization, but still worth doing.

  1. Self-Curation with Jaccard Similarity (Retrieval Precision)

Duplicate and near-duplicate entries degrade retrieval in two ways. First, they split the relevance signal across multiple entries, so top-K results contain redundant information instead of diverse coverage. Second, they inflate store size, which pushes down the recall rate of genuinely distinct entries over time.

Self-curation scans the store periodically for near-duplicate pairs using Jaccard token-overlap similarity and flags them as merge candidates. The script surfaces candidates for review rather than automatically deleting or merging anything, which is the right default for a production system where false positives have real consequences.

Measured result: 1 merge candidate found with similarity ≥ 0.30, score = 0.44, between:

  • "Outside lights include porch, driveway, patio, and garden path lights..."

  • "Often asks for 'outside lights' when referring to porch, driveway, patio..."

These two entries encode the same semantic fact from slightly different angles. Merging them into one canonical entry improves retrieval precision and store cleanliness.

def jaccard_similarity(left: str, right: str) -> float:
    stopwords = {
        "a", "an", "and", "are", "as", "at", "be", "before", "for", "from",
        "in", "is", "it", "of", "on", "or", "should", "the", "to", "when", "with",
    }
    left_terms = set(re.findall(r"[a-z0-9]+", left.lower())) - stopwords
    right_terms = set(re.findall(r"[a-z0-9]+", right.lower())) - stopwords
    if not left_terms or not right_terms:
        return 0.0
    return len(left_terms & right_terms) / len(left_terms | right_terms)

def self_curation_candidates(
    entries: list[tuple[str, dict[str, str]]]
) -> list[tuple[str, str, float]]:
    candidates = []
    for i, (left, _) in enumerate(entries):
        for right, _ in entries[i + 1:]:
            score = jaccard_similarity(left, right)
            if score >= 0.30:
                candidates.append((left, right, score))
    return sorted(candidates, key=lambda item: item[2], reverse=True)
def jaccard_similarity(left: str, right: str) -> float:
    stopwords = {
        "a", "an", "and", "are", "as", "at", "be", "before", "for", "from",
        "in", "is", "it", "of", "on", "or", "should", "the", "to", "when", "with",
    }
    left_terms = set(re.findall(r"[a-z0-9]+", left.lower())) - stopwords
    right_terms = set(re.findall(r"[a-z0-9]+", right.lower())) - stopwords
    if not left_terms or not right_terms:
        return 0.0
    return len(left_terms & right_terms) / len(left_terms | right_terms)

def self_curation_candidates(
    entries: list[tuple[str, dict[str, str]]]
) -> list[tuple[str, str, float]]:
    candidates = []
    for i, (left, _) in enumerate(entries):
        for right, _ in entries[i + 1:]:
            score = jaccard_similarity(left, right)
            if score >= 0.30:
                candidates.append((left, right, score))
    return sorted(candidates, key=lambda item: item[2], reverse=True)
def jaccard_similarity(left: str, right: str) -> float:
    stopwords = {
        "a", "an", "and", "are", "as", "at", "be", "before", "for", "from",
        "in", "is", "it", "of", "on", "or", "should", "the", "to", "when", "with",
    }
    left_terms = set(re.findall(r"[a-z0-9]+", left.lower())) - stopwords
    right_terms = set(re.findall(r"[a-z0-9]+", right.lower())) - stopwords
    if not left_terms or not right_terms:
        return 0.0
    return len(left_terms & right_terms) / len(left_terms | right_terms)

def self_curation_candidates(
    entries: list[tuple[str, dict[str, str]]]
) -> list[tuple[str, str, float]]:
    candidates = []
    for i, (left, _) in enumerate(entries):
        for right, _ in entries[i + 1:]:
            score = jaccard_similarity(left, right)
            if score >= 0.30:
                candidates.append((left, right, score))
    return sorted(candidates, key=lambda item: item[2], reverse=True)

Running this in production: Jaccard similarity is O(n²) over entry count, which is fine up to a few thousand entries but becomes expensive at 50,000+. For large stores, run it as a weekly offline job on the most recently added entries rather than the full store. The freshest entries are where duplicates are most likely to accumulate, since memory extraction pipelines tend to write similar facts from similar queries around the same entities.

  1. Hybrid Hot/Cold Caching (83% RAM Reduction)

Like quantization, this technique does not reduce prompt tokens. It reduces retrieval latency and active RAM footprint by keeping only the most query-relevant entries in a fast in-memory hot layer while the rest sits in slower cold storage.

Entries with the highest term overlap with recent queries stay hot. The rest are demoted and promoted back when they accumulate enough recent accesses. In production this sits in front of your vector store or memory API, using real access statistics from your agent's query logs. The script uses a simulated access pattern, so treat the hit rate as a reproducible proxy rather than a production benchmark.

Measured result: The results show 4/24 hot entries with simulated hit rate of 40.0%, and RAM reduction proxy of 83.3%.

def hot_cold_cache_metrics(
    entries: list[tuple[str, dict[str, str]]], query: str
) -> dict[str, float]:
    query_terms = set(re.findall(r"[a-z0-9]+", query.lower()))
    scored = []
    for text, _ in entries:
        terms = set(re.findall(r"[a-z0-9]+", text.lower()))
        score = len(query_terms & terms)
        scored.append((score, text))

    # Keep top 20% in hot layer
    hot = {text for score, text in sorted(scored, reverse=True)[:max(1, len(entries) // 5)]}
    access_pattern = [text for _, text in sorted(scored, reverse=True)[:10]]
    access_pattern += [text for _, text in scored[:5]]
    hits = sum(1 for text in access_pattern if text in hot)
    hit_rate = hits / len(access_pattern) if access_pattern else 0.0
    ram_reduction = 1 - (len(hot) / len(entries)) if entries else 0.0

    return {
        "hot_count": len(hot),
        "total_count": len(entries),
        "hit_rate": hit_rate,
        "ram_reduction": ram_reduction,
    }
def hot_cold_cache_metrics(
    entries: list[tuple[str, dict[str, str]]], query: str
) -> dict[str, float]:
    query_terms = set(re.findall(r"[a-z0-9]+", query.lower()))
    scored = []
    for text, _ in entries:
        terms = set(re.findall(r"[a-z0-9]+", text.lower()))
        score = len(query_terms & terms)
        scored.append((score, text))

    # Keep top 20% in hot layer
    hot = {text for score, text in sorted(scored, reverse=True)[:max(1, len(entries) // 5)]}
    access_pattern = [text for _, text in sorted(scored, reverse=True)[:10]]
    access_pattern += [text for _, text in scored[:5]]
    hits = sum(1 for text in access_pattern if text in hot)
    hit_rate = hits / len(access_pattern) if access_pattern else 0.0
    ram_reduction = 1 - (len(hot) / len(entries)) if entries else 0.0

    return {
        "hot_count": len(hot),
        "total_count": len(entries),
        "hit_rate": hit_rate,
        "ram_reduction": ram_reduction,
    }
def hot_cold_cache_metrics(
    entries: list[tuple[str, dict[str, str]]], query: str
) -> dict[str, float]:
    query_terms = set(re.findall(r"[a-z0-9]+", query.lower()))
    scored = []
    for text, _ in entries:
        terms = set(re.findall(r"[a-z0-9]+", text.lower()))
        score = len(query_terms & terms)
        scored.append((score, text))

    # Keep top 20% in hot layer
    hot = {text for score, text in sorted(scored, reverse=True)[:max(1, len(entries) // 5)]}
    access_pattern = [text for _, text in sorted(scored, reverse=True)[:10]]
    access_pattern += [text for _, text in scored[:5]]
    hits = sum(1 for text in access_pattern if text in hot)
    hit_rate = hits / len(access_pattern) if access_pattern else 0.0
    ram_reduction = 1 - (len(hot) / len(entries)) if entries else 0.0

    return {
        "hot_count": len(hot),
        "total_count": len(entries),
        "hit_rate": hit_rate,
        "ram_reduction": ram_reduction,
    }

Implementation options: Redis with a TTL-based eviction policy works well for the hot layer in production. Entries not accessed within 48 hours are automatically demoted. For fully local deployments without Redis, a Python dict with an LRU eviction policy handles most use cases up to a few thousand hot entries.

How to Stack These Based on the Failure Mode

Not all six techniques belong in every deployment. Here is how to decide which ones to add first:

  • If prompt token cost is climbing week over week, start with token budgeting (Technique 1). It is around 15 lines of code, requires no changes to your memory store, and immediately caps your worst-case memory token spend. Add hierarchical summarization (Technique 2) next if your entries are verbose. Add Ebbinghaus eviction (Technique 3) after two to three weeks of operation when stale entries start accumulating.

  • If you are running on constrained hardware like a Raspberry Pi or homelab with limited RAM, prioritize embedding quantization (Technique 4) and hot/cold caching (Technique 6) first. These do not touch your prompt at all but can be the difference between a stable deployment and one that thrashes under memory pressure.

  • If retrieval precision is degrading as your store grows, schedule a weekly self-curation pass (Technique 5). This is especially important for agents that write their own memories, since extraction pipelines tend to produce near-duplicate entries around frequently queried entities.

Run --run-advanced against your own memory store to get measured numbers for all six techniques against your actual entries before committing to any of them:

python hermes_token_comparator.py \\
  --memory-dir ~/.hermes/memories \\
  --run-advanced \\
  --eviction-threshold 0.15 \\
  --advanced-budget-tokens 500
python hermes_token_comparator.py \\
  --memory-dir ~/.hermes/memories \\
  --run-advanced \\
  --eviction-threshold 0.15 \\
  --advanced-budget-tokens 500
python hermes_token_comparator.py \\
  --memory-dir ~/.hermes/memories \\
  --run-advanced \\
  --eviction-threshold 0.15 \\
  --advanced-budget-tokens 500

What to Do Next

Clone the repo, point it at your own --memory-dir, and run the above command. The numbers you get back are specific to your memory store, your entry distribution, and your query patterns, so they are more actionable than any benchmark from a different agent.

If prompt token cost is your immediate problem, start with token budgeting and have a ceiling in place in under an hour. If your agent has been running for months and retrieval feels less accurate than it used to, run the self-curation scan first and see how many duplicate entries have accumulated. If you are on constrained hardware, check the quantization output before doing anything else.

The right technique depends on what is actually failing.

⭐️ Clone the repo and run --run-advanced

Frequently Asked Questions

Q. Do these techniques work with agents other than Hermes?

Yes. The underlying patterns such as token budgeting, summarization, eviction, quantization, deduplication, caching, apply to any agent framework with a persistent memory store. LangGraph, OpenClaw, and LangChain agents all exhibit the same failure modes as the store grows.

Q. Which technique should I add first?

Token budgeting. It has the highest measured impact, the lowest implementation effort, and no risk of accidentally removing important memories.

Q. Is the 75% token reduction from budgeting realistic at larger store sizes?

That result uses a budget of 80 tokens, which is aggressive. At 500–800 tokens you can expect 50–65% reduction with better coverage for multi-part queries. The cost ceiling stays fixed regardless of store size, which matters more than the exact percentage.

Q. How do I add access tracking for Ebbinghaus eviction in production?

Log a last_accessed_at timestamp and an access_count per entry in your memory store's metadata. The eviction function reads those values instead of simulating them from entry position.

Q. Does int8 quantization work with all embedding models?

Many do, but validate on your actual model before deploying. Models with embedding dimensions under 128 are most likely to show noticeable cosine error after quantization.

GET TLDR from:

Summarize

Website/Footer

Summarize

Website/Footer

Summarize

Website/Footer

Summarize

Website/Footer