Engineering

Engineering

AI Memory Management for LLMs and Agents

ai memory management thumbnail

Every LLM interaction starts the same way: blank slate, no context, no history, no awareness of who the user is or what they have said before. That is the default. And for simple, one-off queries, it is fine.

But production AI systems are rarely one-off. Customer support agents handle ongoing cases that span days. Coding assistants work across repositories and sessions. Research tools accumulate findings over weeks. Personal AI assistants need to hold preferences, past conversations, and accumulated knowledge to be useful at all.

The gap between a stateless LLM and an agent that can actually remember is not a model capability question. It is a memory management question. This article explains how memory management works in practice - what the data structures look like, how extraction pipelines work, where things break down, and what the benchmark evidence says about different approaches.

Why This Is Harder Than It Looks

The naive approach to memory is intuitive: store conversation history, replay it on each request. If the user told the agent something last Tuesday, that turn is in the history, the model sees it, problem solved.

This works until it doesn't.

At 10 conversation turns, full-history replay is manageable. At 40 turns - common in any ongoing customer relationship or extended project - the input token count has ballooned well past what makes sense to send on every call. And even when the tokens are available, the model does not give every position in the context window equal attention. Stanford NLP's "Lost in the Middle" research documented this precisely: retrieval accuracy degrades significantly when relevant information lands in the middle of a long context, away from the edges where model attention is strongest.

There is also the session boundary problem. No matter how much history you accumulate in a single conversation, the context window resets when the session ends. Every new session starts from zero unless you have built something to persist and retrieve the relevant parts. A context window is working memory. What agents actually need is long-term memory - a different system entirely.

Mem0's published benchmark research (arXiv:2504.19413, accepted at ECAI) measured both failure modes precisely. Against the LOCOMO benchmark - a rigorous evaluation designed specifically for long-term conversational memory - the full-context approach (sending everything, every time) scored 72.9% accuracy at a median latency of 9.87 seconds and a p95 of 17.12 seconds. Mem0's selective memory retrieval scored 66.9% accuracy at 0.71 seconds median, 1.44 seconds p95. That p95 gap - 17 seconds versus 1.4 seconds - is the difference between a usable product and an abandoned one.

The headline number is this: full-context approaches consume roughly 26,000 tokens per conversation. Memory-based retrieval consumes roughly 1,800. That 93% reduction is not a configuration choice. It is the direct result of how the extraction pipeline works.

The Four Memory Layers

Effective memory management for agents is not a single store. It is a hierarchy of stores with different retention windows, retrieval patterns, and purposes.

Conversation memory is the active context window - what the model can see right now. Everything in a live turn is here: system prompt, current tool outputs, the messages in the current exchange. It is fast and complete, but it resets at the end of the session and its capacity is constrained by the model's context limit. This is where most teams stop.

Session memory spans a single task or goal sequence. It holds the thread of what is happening in a work session - the file the user is editing, the current debugging hypothesis, the draft under review. Session memory survives across the individual turns within a session but does not need to persist indefinitely. Not everything that matters now will matter in six months.

User memory is the long-term layer. It holds what is durably true about a person: their preferences, the tools they use, the projects they are working on, decisions they have made, communication style, domain context. User memory is the layer that makes an agent feel like it actually knows someone rather than starting fresh each time. It is also the most valuable and the most expensive to maintain correctly, because it has to stay accurate as things change.

Organizational memory operates at the team or company level. Shared policies, knowledge base entries, team-wide context that should be consistent across every agent and every user in a deployment. This is the layer that matters for enterprise applications where an agent needs to reflect company-specific knowledge regardless of which employee is using it.

Each layer answers a different question. Conversation memory answers "what is happening right now?" Session memory answers "what is the context for this task?" User memory answers "what do I know about this person?" Org memory answers "what is universally true for everyone in this organization?" A functional memory management system needs all four. Most implementations have only the first.

The short-term vs. long-term memory breakdown covers how these layers map to each other and where the architectural handoffs are.

How Memory Extraction Actually Works

Storing memories is easy. Storing the right memories - and keeping them accurate as they age - is the engineering problem.

The approach that produces the benchmark numbers above runs on a two-phase pipeline. The first phase handles extraction: taking new conversational input and deciding what, if anything, should change in the memory store. The second phase handles update: making those changes in a way that keeps the store accurate and non-redundant.

Phase 1: Extraction

Not every message contains memory-worthy information. "What time is it in Tokyo?" does not tell you anything persistent about the user. "I need this to work in a HIPAA-compliant environment and we're on AWS" tells you a lot and should survive long after the conversation ends.

The extraction phase runs an LLM pass over new conversational content with a specific objective: identify discrete, durable facts. Not summaries. Not compressed versions of the conversation. Specific extractable facts - "user is building a compliance tool," "user prefers Python," "user's timezone is CET" - that can be stored, searched, and retrieved independently.

This mirrors what cognitive science calls deep processing. Craik and Lockhart's 1972 levels-of-processing research established that the depth of encoding at storage time determines retrieval quality later. Storing raw text is shallow - you are preserving surface form without extracting meaning. Storing discrete facts is deep - you are transforming content into semantic units that retrieval can find precisely. The cognitive science breakdown covers these connections in detail.

Phase 2: Update

Extracted facts do not simply get appended to the store. Before any write happens, each candidate is compared against what already exists. Four operations are possible:

ADD - The fact is new. Nothing in the existing store covers this information. Write it.

UPDATE - A related fact exists but the new information supersedes it. The user previously worked at Company A; now they have moved to Company B. The old fact gets replaced, not preserved alongside the new one.

DELETE - The existing fact is no longer true based on the new information. Remove it.

NOOP - The new information duplicates something already in the store. Take no action.

This is why interference does not compound over time. In a naive append-only store, every update adds a new fact alongside old ones. Eventually the store contains "user works at Company A" and "user works at Company B" as simultaneous true statements. Retrieval has to resolve that contradiction at query time - and often cannot. The four-operation update pipeline resolves it at write time instead. The store stays coherent as it grows rather than becoming noisier.

Memory Types: What Gets Stored

Not all memory content is the same kind of thing, and treating it as undifferentiated text is one of the more common design mistakes in production agent memory.

Semantic memory is factual knowledge: what the user does, what tools they use, what their domain is, what preferences they have expressed. This is the primary candidate for the user memory layer. It is declarative, updateable, and directly useful for personalizing future responses.

Episodic memory is event-specific: what happened in a particular conversation, what decisions were made during a specific project, what a user tried that did not work. Episodic memory is useful for continuity ("you mentioned last week that you tried X") but decays in relevance faster than semantic memory. It often belongs in session memory rather than user memory.

Procedural memory covers how things are done: workflows a user follows, coding patterns they prefer, communication formats that work for them. Procedural memory influences behavior rather than knowledge - it shapes how an agent responds rather than what it knows. It is underused in current agent systems and disproportionately valuable when captured correctly.

Working memory is the active context - the content of the current context window. The short-term memory guide covers how this layer works and how it interacts with retrieval from longer-term stores.

Understanding which type of memory you are storing changes where it should live and how long it should be retained. Semantic facts about a user stay relevant for months. Details of a specific debugging session may be irrelevant after the week ends.

The Retrieval Problem

Storing memories is only half the problem. The more common failure mode is retrieval that returns the wrong things, or too many things, or nothing at all when something relevant exists.

The retrieval approach used in most implementations is vector similarity search: embed the query, find the nearest stored memories by embedding distance. This works for straightforward cases but has documented weaknesses. It is sensitive to phrasing - a query about "data privacy" may not retrieve a stored fact about "HIPAA compliance" depending on how the embeddings were constructed. It has no concept of time - a stale preference from two years ago ranks equally with something stated last week. And it has no concept of contradiction - it returns whatever is most similar regardless of whether those results are mutually consistent.

Mem0g, the graph-enhanced variant of Mem0's memory system, addresses the structural retrieval problem by storing memories as a directed, labeled graph rather than a flat vector store. Entities are nodes. Relationships between entities are edges with labels. A stored fact about a user's employer becomes a node for the user, a node for the company, and an edge labeled "works at" connecting them.

When retrieval happens, it can traverse the graph structure rather than just measuring embedding distance. If you know a user works at a healthcare company, and the current query is about data handling requirements, the graph can follow the relationship chain from user to employer to industry to relevant regulatory context. Flat vector retrieval cannot do this - it can only find embeddings that are close to the query vector.

The benchmark numbers reflect this. On the LOCOMO evaluation, Mem0g scored 68.4% versus Mem0's 66.9%, at a p95 latency of 2.59 seconds - closer to full-context accuracy while staying near memory-based latency. The accuracy improvement comes from the relational traversal. The latency stays low because the graph is queried selectively, not scanned in full.

The graph memory guide covers the implementation in detail.

Memory Scoping for Multi-User and Multi-Agent Systems

A memory system that works correctly for a single user becomes a liability in production when scope is not managed carefully. Mem0's architecture supports five scoping dimensions:

  • user_id - Memory specific to an individual user. The most common scope. Every memory tied to a user_id is private to that user and returned only when queries use the same user_id.

  • session_id - Memory for a specific session or task sequence. Useful for maintaining continuity within a project sprint or support case without permanently writing session-specific context to the user layer.

  • agent_id - Memory associated with a specific agent instance. Relevant when you have multiple specialized agents in the same deployment - a code review agent and a documentation agent might both interact with the same user but should maintain separate learned context.

  • run_id - Memory scoped to a single execution or pipeline run. Useful for batch processing where you need to track what happened in a specific run without polluting the user's persistent memory.

  • org_id - Organizational memory shared across all users and agents in a deployment. The right scope for shared knowledge bases, company policies, or team-wide context that every agent instance should have access to.

Scopes can be combined. A query with both user_id and org_id will retrieve memories specific to that user plus any org-level context that is relevant - a common pattern in enterprise deployments where agents need to blend personal context with company knowledge.

Getting scoping wrong creates two failure modes. Over-broad scoping bleeds memory between users or contaminates the org layer with individual user data. Under-broad scoping means the agent cannot access memories it should be using, which looks like poor retention even when the data is correctly stored.

The multi-agent memory systems guide covers how to structure scopes for complex agent architectures with multiple specialized agents operating on shared data.

Forgetting as a Design Requirement

One of the counterintuitive requirements of a good memory management system is that it must forget things.

Robert Bjork's "New Theory of Disuse" from cognitive psychology provides the theoretical basis: forgetting is not a passive failure of biological memory. It is an active, adaptive process that protects retrieval quality. Information that is rarely accessed correctly loses retrieval strength over time - not because the memory is gone, but because the system has learned it is unlikely to be useful. This keeps frequently accessed information surfaceable without burying it under a growing pile of stale facts.

For AI memory systems, the analogy is direct. A store that never removes anything becomes harder to use as it grows. User preferences from two years ago that are no longer relevant compete with current preferences in retrieval. Old project contexts that have been superseded add noise to every related query. The interference accumulates until retrieval quality degrades measurably.

Dynamic forgetting in a production memory system means: memories have relevance scores that decay if they are not reinforced by new interactions, and entries that fall below a threshold are pruned from the active store. This is not the same as deleting important information. High-relevance, frequently-accessed memories maintain their position. The decay only affects what the user has stopped reinforcing - which is often a reasonable proxy for what is no longer true or no longer relevant.

The implication for memory management architecture is that the store needs a lifecycle management layer - not just read and write operations, but a maintenance process that runs against existing entries to prune what is stale. Most production teams build this as a separate background job rather than part of the write path.

The Case for Local and Privacy-First Memory

Not every deployment can send user memory data to a cloud API. Healthcare applications with PHI, legal applications with privileged information, and any enterprise operating under strict data residency requirements need memory management that runs locally.

In these cases, you can self-host Mem0, keeping all memory data on your own infrastructure. It extracts, deduplicates, and retrieves only what’s relevant at query time, while still integrating with tools like Claude, Cursor, and other MCP-compatible clients.

The tradeoff is infrastructure: local deployment means local resource management, local backup, and local reliability. For teams with compliance requirements that preclude cloud memory storage, it is a necessary tradeoff. For most development use cases and consumer applications, the cloud-hosted option is the lower-overhead path.

Integration Patterns in Production

The memory management system does not operate in isolation. It needs to integrate with whatever framework is running the agent - and the integration pattern affects how memory operations fit into the agent's request/response cycle.

For frameworks like LangChain and LangGraph, Mem0 integrates as a retriever and a storage backend. The agent can call memory search at the start of each turn to surface relevant context, and queue memory write operations at the end to capture anything worth persisting. In LangGraph specifically, memory operations can be embedded as nodes in the graph - a retrieve node at the beginning of the workflow, a store node at the end - which keeps the memory logic explicit in the workflow definition rather than buried in the agent's system prompt.

For Mastra, the integration exposes two tools directly to the agent: a Mem0-remember tool for retrieval and a Mem0-memorize tool for storage. The agent's own reasoning drives when each tool fires. The memorize tool saves asynchronously - the write happens in the background without blocking the agent's response, which keeps latency from compounding on write-heavy sessions. The remember tool uses semantic search, so the agent does not need to know exactly what it is looking for - it can pass the current user question and the retrieval system finds the relevant stored context.

The key architectural choice across all integration patterns is whether memory operations are agent-driven (the agent decides when to read and write) or pipeline-driven (memory retrieval and storage happen automatically at defined points in every request cycle). Agent-driven memory is more flexible and avoids unnecessary memory operations on queries where history is not relevant. Pipeline-driven memory is more consistent - it does not depend on the agent's reasoning to correctly identify memory-worthy moments.

For most production applications, a hybrid approach works best: automatic retrieval at the start of each request (so the agent always has relevant context), and agent-driven storage (so the agent decides what from the current turn is worth keeping).

What the Numbers Actually Say

Three metrics matter for production memory management: accuracy, latency, and token efficiency. The LOCOMO benchmark evaluation measured all three across the major approaches. The full results from Mem0's published research:

Full-context (send everything every time): 72.9% accuracy, 9.87s median latency, 17.12s p95, ~26,000 tokens per conversation.

Mem0 (vector-based selective retrieval): 66.9% accuracy, 0.71s median latency, 1.44s p95, ~1,800 tokens per conversation.

Mem0g (graph-enhanced retrieval): 68.4% accuracy, 1.18s median latency, 2.59s p95, ~1,800 tokens per conversation.

OpenAI Memory: 52.9% accuracy.

ReadAgent: 46.4% accuracy.

MemoryBank: 31.3% accuracy.

A-Mem: 68.6% accuracy.

LangMem: 50.9% accuracy.

The accuracy gap between full-context and Mem0g is 4.5 percentage points. The latency gap is roughly 7x at median and 6.6x at p95. The token gap is roughly 14x. The question every production team has to answer is whether that 4.5 point accuracy premium is worth the latency and cost penalty - and whether a 17-second p95 response time is acceptable in their actual product context.

For voice agents, the latency issue is decisive: a 17-second wait on one in twenty responses means a dead line in a live call. For coding copilots where users are actively waiting for responses, the friction compounds across every slow interaction. For customer support applications at volume, the token cost difference is the unit economics question.

For batch processing, background analysis, or async summarization tasks where response time is not user-facing, full-context approaches become more viable. The right answer is workload-dependent, but for any interactive, real-time agent, the memory-based approach is the production-viable path.

The research page has the full benchmark methodology and comparison tables.

Building the Right Memory Architecture

The practical sequence for teams adding memory management to an existing agent system:

Start with scope definition. Identify which memory layers you actually need. Most agents need user memory and conversation memory at minimum. Session memory matters if users work in extended task sequences. Org memory matters if shared knowledge needs to be consistent across users.

Choose extraction over summarization. Summarization compresses conversation history but keeps it as undifferentiated text. Extraction pulls discrete facts and stores them as searchable, updateable units. Extraction has higher upfront processing cost and pays back in retrieval quality, accuracy, and the ability to update individual facts without reprocessing entire conversation threads. The LLM chat history guide covers this trade-off in detail.

Build the update pipeline, not just the storage. The four-operation pipeline (ADD, UPDATE, DELETE, NOOP) is what keeps the store accurate over time. An append-only store that never updates or removes anything will degrade in retrieval quality as it grows.

Implement lifecycle management. Build in memory decay for low-relevance entries. This is not optional for long-running production systems - it is what keeps retrieval quality from degrading as the store ages.

Get scoping right before going multi-user. user_id isolation is the minimum. If you have multiple agents or organizational contexts, define the scoping model before the first deployment rather than retrofitting it after.

Measure latency and accuracy separately. Retrieval accuracy and response latency pull in different directions as context grows. Benchmark both for your specific workload rather than relying on general numbers. What matters is how the system performs on the queries your users actually ask.

The AI agent memory guide and the long-term memory guide cover the implementation side of these decisions in detail.

Memory management is where the gap between a stateless LLM and a useful agent actually closes. The model capability is table stakes. What makes an agent worth using over time is whether it remembers - accurately, selectively, and without degrading as the relationship grows.

External references:

GET TLDR from:

Summarize

Website/Footer

Summarize

Website/Footer

Summarize

Website/Footer

Summarize

Website/Footer

Give your AI a memory and personality.

Instant memory for LLMs—better, cheaper, personal.

Give your AI a memory and personality.

Instant memory for LLMs—better, cheaper, personal.

Give your AI a memory and personality.

Instant memory for LLMs—better, cheaper, personal.