Long Context
In 2020, a 2k-token context felt generous. By 2026, 1M-token contexts are common and 10M-token research models exist. Long context changes the design of LLM systems: less RAG, more direct reading, but with new failure modes.
What “long context” enables
- Whole-codebase reasoning: load 100k–1M lines of code, ask questions.
- Whole-book summarization: a novel fits in 200k tokens.
- Conversation history: weeks of chat history without truncation.
- Document-grounded QA without RAG: drop the corpus in, ask.
- Multi-document synthesis: 50 PDFs in one prompt.
What it doesn’t fix
- Retrieval quality: even with 1M tokens, you still need good chunking/ranking if you have 100M.
- Attention “needle in a haystack”: models retrieve well in some positions and worse in others.
- Latency: 1M-token prompts are slow to process.
- Cost: 1M-token prompts are expensive.
- Hallucination: longer context → more chance for the model to confabulate from somewhere irrelevant.
Long context complements RAG; it doesn’t replace it. The question is what to put in context at any given step.
How it’s implemented
The naive transformer is O(T²) in sequence length. For T = 1M, that’s infeasible. Several layers of tricks:
Architectural
- GQA / MQA (Stage 06): reduces KV cache size.
- Sparse / local attention: tokens only attend to a window or pattern.
- Sliding window attention: each token attends to the last
Wtokens. Mistral uses 4k window with shared global tokens. - Hybrid models: mix attention with state-space models (Mamba, RWKV) for long-range without O(T²).
- Multi-head latent attention (MLA): DeepSeek’s compression of K/V to a low-rank latent.
Positional
- RoPE with high base (1M+): supports longer contexts directly.
- YaRN, NTK-aware scaling, LongRoPE: extend RoPE-trained models post-hoc.
- Position interpolation: smooth scaling of positional encodings.
Training
- Curriculum: start with short contexts, lengthen progressively.
- Continued pretraining at long context with a small high-quality long-context dataset.
- Long-context RL: tune for retrieval and reasoning at long contexts.
Inference
- FlashAttention v2/v3: tile-based attention, no full T×T matrix in memory.
- Paged KV cache: allocate KV memory in pages, like virtual memory.
- Prefix caching: reuse computed KV for shared prompt prefixes.
The “needle in a haystack” benchmark
A classic long-context test: hide a short piece of information (“The best place to eat in San Francisco is Tartine”) in a giant block of unrelated text, then ask a question about it.
Results vary by:
- Position (where in the haystack the needle lives).
- Topic (relevant vs. unrelated to surrounding text).
- Distractors (how similar the surrounding noise is).
Frontier models (Claude, Gemini, GPT-4.x) score nearly perfectly on simple haystacks at 200k+ tokens. Multi-needle and reasoning over long context are harder — many models degrade past 32k–128k tokens even when haystack tests look perfect.
Long-context evaluation
Beyond the haystack test:
- RULER (NVIDIA 2024): synthetic long-context tasks at varied lengths.
- InfiniteBench, ZeroSCROLLS, LongBench: real-world tasks (summarization, multi-doc QA).
- NoLiMa: needle-in-haystack with semantic (not exact-match) needles — much harder.
- OpenAI MRCR, Anthropic’s NIAH variants: vendor-specific evals.
For your own application, build a long-context eval that mirrors your data.
When to use long context vs RAG
Rules of thumb:
| Scenario | Long context | RAG | Both |
|---|---|---|---|
| Corpus < 100k tokens | ✓ | ||
| Corpus < 1M tokens | possible | ✓ | |
| Corpus > 1M tokens | impractical | ✓ | |
| Need source citations | hard | easy | use RAG, even with LC model |
| Latency-critical | no | yes | |
| Cost-sensitive | no | yes | |
| Information across docs needs connecting | yes | needs care | RAG → reorder → LC |
The modern hybrid: retrieve a tighter set of context (e.g. top 50 chunks → ~50k tokens), then let the LLM reason over the full retrieved set. Better than top-5-only RAG, much cheaper than dumping the whole corpus.
Long-context cost model
For a 1M-token prompt:
- ~$5–$15 per query at frontier model prices (input tokens).
- ~5–60 second processing time.
- Reasoning models on top of long context: even more.
Caching helps:
- Prompt caching (Anthropic, OpenAI, Gemini): cache the static portion of a long prompt at ~10× cheaper. If you query a 200k-token document repeatedly, only the new part is full-priced.
This makes “load a book and ask 50 questions” practical and cheap.
Practical patterns
- Always cache when reusing context. Saves cost dramatically.
- Place critical info early or late in the prompt. Models attend better to those positions.
- Use RAG to filter before long context — relevance > capacity.
- Test your specific use case — synthetic benchmarks don’t always predict.
- Measure latency under your real distribution.
- Avoid mid-context distractors — they confuse models more than positional limits.
What’s next
- 10M+ token contexts with hybrid attention/SSM models (already research-grade).
- Continuous online context — agents that maintain context across days.
- Better long-context training data — synthetic generation pipelines for long-doc reasoning.
- Architectural breakthroughs — Mamba-2, JEPA-style alternatives, retrieval-transformer hybrids.