Miscellaneous

Miscellaneous

Context Engineering in Multi-Turn AI Agents

Building a multi-turn AI agent that stays coherent across a long conversation is an engineering problem, not a prompting problem. The model itself is stateless, with no memory between calls.

What creates the illusion of a continuous conversation is the context window: the array of messages passed to the model on every single request. Context engineering is the deliberate practice of deciding what goes into that array and how it is shaped.


What Context Engineering Actually Means

Context engineering is a discipline distinct from prompt engineering. Prompt engineering focuses on the wording and structure of a single instruction. Context engineering focuses on the full information environment the model sees across an entire session: which messages to include, which to drop, how to compress history, how much budget to reserve for different components, and how to sequence retrieved information.

For a single-turn chatbot, context engineering barely matters. The model gets a system prompt and a user message and replies. For a multi-turn agent, one that executes tools, accumulates results, makes decisions, and builds on prior exchanges, context engineering is the core architectural concern. Every design choice about what the model sees at step N affects what it can reason about at step N+1.


The Context Window as Working Memory

Every call to an LLM takes a list of messages and returns a reply. That list is the model's working memory for that call. It sees everything in the list and nothing outside it. The moment the API call completes, nothing is retained. On the next call, whatever the developer passes in the messages array is all the model knows.

Context windows are bounded. Claude Sonnet 4.6 supports up to 1 million tokens. In practice, bounded does not mean unlimited: longer contexts cost more per call, increase latency, and beyond a certain density, models attend less precisely to material buried in the middle of a very long context. The engineering problem is not just fitting within the limit; it is staying within a practical budget while preserving the information the model needs.

The components competing for that budget in a typical multi-turn agent are:

  • System prompt: role definition, instructions, project context, tool descriptions

  • Conversation history: prior user messages and assistant replies

  • Tool results: outputs from tool calls the agent made during the session

  • Retrieved context: external documents or data injected for this turn

  • Current user input: the message being processed right now

Each component has a different profile. The system prompt is fixed. Conversation history grows turn by turn. Tool results can spike unpredictably; a file read or API response can return thousands of tokens in a single call. Developers who do not actively manage these competing demands run into context budget problems quickly.


How Context Accumulates in Multi-Turn Agents

A multi-turn agent does not just exchange text. It calls tools, receives results, reasons over them, calls more tools, and continues. Every one of these steps adds tokens to the message array passed on the next call.

Consider a simple research agent:

  • Turn 1: User asks a question. Agent calls WebSearch. Result: 2,000 tokens.

  • Turn 2: Agent reads a document with Read. Result: 4,000 tokens.

  • Turn 3: Agent calls another search. Another 2,000 tokens.

  • Turn 4: User asks a follow-up. Agent needs to synthesize across all prior context.

By turn 4, the agent is passing roughly 10,000 tokens of accumulated state before the user's follow-up question even arrives. Over a long session, say 30 turns with regular tool use, this compounds significantly. The developer who did nothing to manage growth is now working with a very long, very expensive context that contains a lot of information the model no longer needs.

The naive implementation of appending every message and every tool result to a growing list works until it does not. It fails in predictable ways: budget exhaustion, latency spikes, and degraded coherence from the model attending across too much irrelevant early history.


Core Context Engineering Techniques

There are several established techniques for managing context growth in multi-turn agents. Most production systems combine more than one.

Sliding window truncation is the simplest approach. Keep only the N most recent messages. Older exchanges are dropped. This keeps the context bounded but sacrifices continuity: if the user referenced something from turn 3 and the agent is now on turn 20, that reference is gone. Truncation works well for agents where recency dominates: customer support bots, command-line assistants, short-lived task agents.

Conversation summarization compresses older portions of the conversation into a single summary block. When the context reaches a threshold, the oldest K turns are replaced with a summary generated by the model itself. The summary block then anchors the conversation for subsequent turns. This preserves more semantic content than truncation but introduces lossy compression, where fine-grained details from the original exchanges may not survive the summary.

Selective tool output pruning addresses the problem of large, noisy tool results. A search result that returns 5,000 tokens often contains 200 tokens of relevant signal. Before appending the result to the context, extract or truncate it to just the relevant portion. This requires knowing what the current task needs, which can itself be a model call, but the token savings are often substantial.

Token budget management is the discipline of explicitly allocating the context budget across components before a session begins. Reserve a fixed allocation for the system prompt. Reserve a range for tool descriptions. Set a ceiling on accumulated history. Leave enough headroom for tool results and the current user message. When any component threatens to exceed its allocation, trigger the appropriate compression strategy.

Retrieval-augmented context injection replaces static history with dynamic relevance. Rather than keeping all prior exchanges in context, store them in a retrieval index and pull only the ones relevant to the current user message. The context window at each turn contains the current message, the retrieved history, the system prompt, and recent turns, not the full session. This keeps the window bounded regardless of conversation length.

Token Budget in Practice

A practical token budget for a mid-size agent on Claude Sonnet 4.6 (1M token context, but practical budget ~40K per call for latency and cost reasons) might look like:

  • System prompt and tool descriptions: 3,000–5,000 tokens

  • Recent conversation history (last 10 turns): 6,000–10,000 tokens

  • Retrieved or injected external context: 10,000–15,000 tokens

  • Tool result for current turn: 5,000–8,000 tokens

  • Current user input and output headroom: 5,000–10,000 tokens

The numbers are not universal; they depend on the agent's task, tool verbosity, and tolerance for latency. The discipline is having the budget at all. Agents without explicit budget management tend to grow unbounded until they hit a wall.


What Context Engineering Cannot Do

Context engineering solves the within-session problem. It cannot solve the across-session problem.

Every context engineering technique (truncation, summarization, budget management, retrieval) operates on the messages that exist within the current process. When the session ends, the context window ceases to exist. The next session starts from scratch: a fresh system prompt, no history, no memory of prior interactions.

This creates a concrete ceiling. A user spends a long session working with an agent to solve a complex problem. They come back tomorrow to continue. The agent has no knowledge of yesterday's session. The user repeats context the agent already processed. The agent re-discovers facts it reasoned about the day before.

For single-session agents, such as a one-shot task or a disposable pipeline run, this is not a problem. For agents intended to build a relationship with a user over time, or to accumulate knowledge across many sessions, it is a fundamental limitation that context engineering does not address.

There is also a user-identity problem. Context engineering treats the context window as a single, uniform space. It has no concept of multiple users sharing the same agent, or of a user-specific history that should persist across sessions. CLAUDE.md can inject project-level configuration, but it applies equally to every user and every session; it is not a mechanism for per-user memory.


Persistent Memory as the Next Layer

The limitation of in-context state calls for a different architectural layer: external persistent memory that survives the process boundary. The idea is to extract what is worth remembering from each session and store it in a retrieval system, then inject relevant stored memories at the start of the next session before the agent begins work.

This is distinct from context engineering. Context engineering manages what is in the window during a session. Persistent memory manages what carries over between sessions. They operate at different time scales and solve different problems. Production multi-turn agents typically need both.

The retrieval pattern for persistent memory mirrors the retrieval-augmented context injection technique described earlier, but the source is user history rather than documents. At session start: query the memory store with the current user's prompt and inject the top results. At session end: extract the meaningful exchange from this session and write it to the memory store. Over time, each user accumulates a memory profile that makes each subsequent session richer.


Where Mem0 Fits

Mem0 provides the persistent memory layer that sits outside the context window and extends across sessions. At the end of each agent turn, relevant exchanges are stored with the user's identifier. At the start of the next turn, the most relevant memories are retrieved and injected into context before the agent processes the user's message.

from mem0 import MemoryClient

client = MemoryClient(api_key="your-api-key")

# At session start: retrieve what is relevant to this user and task
memories = client.search(user_message, filters={"user_id": user_id})
memory_context = "\\n".join([m["memory"] for m in memories])

# Inject into the agent's context before the session begins
system_prompt = base_system_prompt
if memory_context:
    system_prompt += f"\\n\\nContext from prior sessions:\\n{memory_context}"

# At session end: store what happened
client.add(
    [{"role": "user", "content": user_message},
     {"role": "assistant", "content": agent_reply}]

from mem0 import MemoryClient

client = MemoryClient(api_key="your-api-key")

# At session start: retrieve what is relevant to this user and task
memories = client.search(user_message, filters={"user_id": user_id})
memory_context = "\\n".join([m["memory"] for m in memories])

# Inject into the agent's context before the session begins
system_prompt = base_system_prompt
if memory_context:
    system_prompt += f"\\n\\nContext from prior sessions:\\n{memory_context}"

# At session end: store what happened
client.add(
    [{"role": "user", "content": user_message},
     {"role": "assistant", "content": agent_reply}]

from mem0 import MemoryClient

client = MemoryClient(api_key="your-api-key")

# At session start: retrieve what is relevant to this user and task
memories = client.search(user_message, filters={"user_id": user_id})
memory_context = "\\n".join([m["memory"] for m in memories])

# Inject into the agent's context before the session begins
system_prompt = base_system_prompt
if memory_context:
    system_prompt += f"\\n\\nContext from prior sessions:\\n{memory_context}"

# At session end: store what happened
client.add(
    [{"role": "user", "content": user_message},
     {"role": "assistant", "content": agent_reply}]

Mem0's semantic retrieval means the injected memories are relevant to the current task, not a dump of everything ever said. The context window stays within budget. User identity is handled via user_id, so multiple users on the same agent accumulate separate memory profiles. Context engineering techniques (window management, summarization, budget allocation) continue to apply within the session. Mem0 handles what survives between them.


Final Notes

Context engineering is the practice of deliberately shaping the information environment the model works within: selecting, compressing, budgeting, and sequencing what goes into the context window across a multi-turn session. It is a necessary discipline for any agent that accumulates state within a session. It is not sufficient for agents that need to carry state across sessions. Persistent memory is the adjacent layer that fills that gap, extending the agent's effective memory beyond the boundary of a single process.

Mem0 is an intelligent, open-source memory layer designed for LLMs and AI agents to provide long-term, personalized, and context-aware interactions across sessions.

GET TLDR from:

Summarize

Website/Footer

Summarize

Website/Footer

Summarize

Website/Footer

Summarize

Website/Footer