Memory Systems

A long-running agent accumulates context fast: tool outputs, intermediate reasoning, prior turns. The context window — even at 1M tokens — fills up. Quality degrades. Memory systems manage this.

Three kinds of memory

A useful model (borrowed from cognitive science):

Working memory: the current context window. Active, short-term.
Episodic memory: specific events from past interactions (“the user said X yesterday”).
Semantic memory: facts/knowledge (“the user prefers concise responses”).

A complete agent has all three.

Working memory: the context window

What’s in it right now:

System prompt
Tool definitions
Conversation history
Recently retrieved context
Latest tool results

When it fills up:

Compaction / summarization

Periodically, replace old turns with a summary:

Turn 1-15: <full content>
                ↓ summarize
[Earlier turns summary: User asked about pricing, agent searched docs, found pricing tier table, user agreed to plan B...]
Turn 16-20: <full content>

Keep recent turns verbatim; older turns become summaries. Common cutoff: when context exceeds 70% of capacity.

Tools to do this well:

Anthropic’s compaction-aware tools.
LangGraph’s memory abstractions.
Custom logic per app.

Selective pruning

Not every tool result is worth keeping. Drop verbose outputs once their conclusions are extracted:

Original: 5000-token web page text
Pruned: “Page summary: … (full content available via tool get_url(url))”

The model can re-fetch if needed — usually it doesn’t need to.

Hierarchical context

For very long sessions, organize context as a hierarchy:

Top-level summary (always in context)
├── Session-level summary (always in context)
└── Turn-level details (selectively included)

Episodic memory

Specific past events. Common implementations:

Persistent conversation history

Store every turn in a DB. Load relevant ones at the start of each session.

def load_context(user_id, current_query):
    relevant = db.query(
        f"SELECT turns FROM history WHERE user_id = ? AND embedding ~ ?",
        user_id, embed(current_query),
    )
    return relevant

Retrieval is RAG over conversation history.

Episodic markers

Tag specific events:

“User completed onboarding on 2025-12-01”
“User reported bug XYZ”
“Agent escalated case 1234”

These act like indexable life events.

Semantic memory

Facts and preferences accumulated about a user, project, or domain.

User profile / preferences

profile = {
    "name": "Alex",
    "preferences": {
        "response_style": "concise",
        "timezone": "America/New_York",
    },
    "facts": [
        "uses Python primarily",
        "manages team of 5",
        "interested in security topics",
    ],
}

Surface this in the system prompt. Update via dedicated tools:

update_user_fact(fact: str)
remove_user_fact(fact_id: str)

Knowledge base

Project-specific, team-specific, organization-specific. The agent reads/writes to it.

A note-taking interface for the agent:

save_note(category: str, content: str)
search_notes(query: str)

This is essentially RAG (Stage 09) with the agent on the write side, not just the read side.

Memory hygiene

Like any database, memory needs maintenance:

Deduplication: avoid storing “user prefers concise” 50 times.
Conflict resolution: when new fact contradicts old, update or both?
Decay: should preferences from a year ago still bind?
Privacy: respect user’s right to forget.
Source tracking: where did this fact come from? When?

When agents should write to memory

Patterns:

Always-on extractor

After every turn, an LLM extracts facts to save:

After user turn, identify any new lasting preferences or facts. Save with the
save_user_fact tool.

Costs an extra LLM call per turn; produces clean memory.

Explicit save tool

The agent decides when to save:

"I should remember that the user prefers Markdown over HTML." → save_user_fact(...)

Cheaper but relies on the model remembering to use it.

End-of-session reflection

At the end of a session, summarize what happened and save key facts. Cheaper, but loses real-time updates.

Long-term memory architectures

MemGPT / Letta

Treat the LLM context like an OS. The agent has explicit “memory tools”:

Read/write to working memory.
Read/write to long-term memory.
Recall by query.

The model paginates through its own memory. Works for very long-running agents.

Memory tree (anthropic-style “memory” feature)

Hierarchical memory: top-level concepts → subconcepts → details. Agent navigates the tree as needed.

Embedding-based + summary

Store every meaningful event with an embedding. Retrieve by relevance. Summarize at higher levels for fast access.

Multi-tenant memory

For B2B / multi-user agents:

Strict per-tenant isolation. No leakage of memory across tenants.
Per-user scoping inside a tenant.
Audit trails: who wrote what, when.

This is mostly an engineering / database design problem, not an agent design problem.

Pitfalls

Memory bloat: agents save everything, retrieval slows, signal-to-noise drops.
Stale memory: a year-old “user prefers X” no longer applies.
Confidently wrong memory: a fact saved during a misunderstanding becomes ground truth forever.
Memory as crutch: the agent saves “I tried this approach” instead of solving the problem.
Forgotten cleanup: tools to remove memory are as important as tools to add it.

Evaluating memory

Measure:

Recall: when relevant memory exists, does the agent retrieve it?
Precision: when retrieving, does it surface the right items?
Update correctness: do new facts override old correctly?
Resilience: does the agent perform well across many turns / sessions?

For long-running agents, “memory eval” is a separate axis from regular evals. Build a synthetic multi-session test where critical facts arrive in session 1 and must be used in session 5.

Don’t over-engineer

For most agents, you don’t need all three memory types from day one. Default progression:

Working memory only: stateless agent, every conversation fresh.
+ conversation history: persist last N turns.
+ summarization: compact history when long.
+ user profile: explicit facts.
+ semantic search over history: retrieve any past turn.
+ structured episodic memory: explicit events with timestamps.
+ hierarchical / tree memory: for very long-lived agents.

Most production agents stop at step 4 or 5.