Memory Systems
A long-running agent accumulates context fast: tool outputs, intermediate reasoning, prior turns. The context window — even at 1M tokens — fills up. Quality degrades. Memory systems manage this.
Three kinds of memory
A useful model (borrowed from cognitive science):
- Working memory: the current context window. Active, short-term.
- Episodic memory: specific events from past interactions (“the user said X yesterday”).
- Semantic memory: facts/knowledge (“the user prefers concise responses”).
A complete agent has all three.
Working memory: the context window
What’s in it right now:
- System prompt
- Tool definitions
- Conversation history
- Recently retrieved context
- Latest tool results
When it fills up:
Compaction / summarization
Periodically, replace old turns with a summary:
Turn 1-15: <full content>
↓ summarize
[Earlier turns summary: User asked about pricing, agent searched docs, found pricing tier table, user agreed to plan B...]
Turn 16-20: <full content>
Keep recent turns verbatim; older turns become summaries. Common cutoff: when context exceeds 70% of capacity.
Tools to do this well:
- Anthropic’s compaction-aware tools.
- LangGraph’s memory abstractions.
- Custom logic per app.
Selective pruning
Not every tool result is worth keeping. Drop verbose outputs once their conclusions are extracted:
- Original: 5000-token web page text
- Pruned: “Page summary: … (full content available via tool
get_url(url))”
The model can re-fetch if needed — usually it doesn’t need to.
Hierarchical context
For very long sessions, organize context as a hierarchy:
Top-level summary (always in context)
├── Session-level summary (always in context)
└── Turn-level details (selectively included)
Episodic memory
Specific past events. Common implementations:
Persistent conversation history
Store every turn in a DB. Load relevant ones at the start of each session.
def load_context(user_id, current_query):
relevant = db.query(
f"SELECT turns FROM history WHERE user_id = ? AND embedding ~ ?",
user_id, embed(current_query),
)
return relevant
Retrieval is RAG over conversation history.
Episodic markers
Tag specific events:
- “User completed onboarding on 2025-12-01”
- “User reported bug XYZ”
- “Agent escalated case 1234”
These act like indexable life events.
Semantic memory
Facts and preferences accumulated about a user, project, or domain.
User profile / preferences
profile = {
"name": "Alex",
"preferences": {
"response_style": "concise",
"timezone": "America/New_York",
},
"facts": [
"uses Python primarily",
"manages team of 5",
"interested in security topics",
],
}
Surface this in the system prompt. Update via dedicated tools:
update_user_fact(fact: str)
remove_user_fact(fact_id: str)
Knowledge base
Project-specific, team-specific, organization-specific. The agent reads/writes to it.
A note-taking interface for the agent:
save_note(category: str, content: str)
search_notes(query: str)
This is essentially RAG (Stage 09) with the agent on the write side, not just the read side.
Memory hygiene
Like any database, memory needs maintenance:
- Deduplication: avoid storing “user prefers concise” 50 times.
- Conflict resolution: when new fact contradicts old, update or both?
- Decay: should preferences from a year ago still bind?
- Privacy: respect user’s right to forget.
- Source tracking: where did this fact come from? When?
When agents should write to memory
Patterns:
Always-on extractor
After every turn, an LLM extracts facts to save:
After user turn, identify any new lasting preferences or facts. Save with the
save_user_fact tool.
Costs an extra LLM call per turn; produces clean memory.
Explicit save tool
The agent decides when to save:
"I should remember that the user prefers Markdown over HTML." → save_user_fact(...)
Cheaper but relies on the model remembering to use it.
End-of-session reflection
At the end of a session, summarize what happened and save key facts. Cheaper, but loses real-time updates.
Long-term memory architectures
MemGPT / Letta
Treat the LLM context like an OS. The agent has explicit “memory tools”:
- Read/write to working memory.
- Read/write to long-term memory.
- Recall by query.
The model paginates through its own memory. Works for very long-running agents.
Memory tree (anthropic-style “memory” feature)
Hierarchical memory: top-level concepts → subconcepts → details. Agent navigates the tree as needed.
Embedding-based + summary
Store every meaningful event with an embedding. Retrieve by relevance. Summarize at higher levels for fast access.
Multi-tenant memory
For B2B / multi-user agents:
- Strict per-tenant isolation. No leakage of memory across tenants.
- Per-user scoping inside a tenant.
- Audit trails: who wrote what, when.
This is mostly an engineering / database design problem, not an agent design problem.
Pitfalls
- Memory bloat: agents save everything, retrieval slows, signal-to-noise drops.
- Stale memory: a year-old “user prefers X” no longer applies.
- Confidently wrong memory: a fact saved during a misunderstanding becomes ground truth forever.
- Memory as crutch: the agent saves “I tried this approach” instead of solving the problem.
- Forgotten cleanup: tools to remove memory are as important as tools to add it.
Evaluating memory
Measure:
- Recall: when relevant memory exists, does the agent retrieve it?
- Precision: when retrieving, does it surface the right items?
- Update correctness: do new facts override old correctly?
- Resilience: does the agent perform well across many turns / sessions?
For long-running agents, “memory eval” is a separate axis from regular evals. Build a synthetic multi-session test where critical facts arrive in session 1 and must be used in session 5.
Don’t over-engineer
For most agents, you don’t need all three memory types from day one. Default progression:
- Working memory only: stateless agent, every conversation fresh.
- + conversation history: persist last N turns.
- + summarization: compact history when long.
- + user profile: explicit facts.
- + semantic search over history: retrieve any past turn.
- + structured episodic memory: explicit events with timestamps.
- + hierarchical / tree memory: for very long-lived agents.
Most production agents stop at step 4 or 5.