Long Context

In 2020, a 2k-token context felt generous. By 2026, 1M-token contexts are common and 10M-token research models exist. Long context changes the design of LLM systems: less RAG, more direct reading, but with new failure modes.

What “long context” enables

  • Whole-codebase reasoning: load 100k–1M lines of code, ask questions.
  • Whole-book summarization: a novel fits in 200k tokens.
  • Conversation history: weeks of chat history without truncation.
  • Document-grounded QA without RAG: drop the corpus in, ask.
  • Multi-document synthesis: 50 PDFs in one prompt.

What it doesn’t fix

  • Retrieval quality: even with 1M tokens, you still need good chunking/ranking if you have 100M.
  • Attention “needle in a haystack”: models retrieve well in some positions and worse in others.
  • Latency: 1M-token prompts are slow to process.
  • Cost: 1M-token prompts are expensive.
  • Hallucination: longer context → more chance for the model to confabulate from somewhere irrelevant.

Long context complements RAG; it doesn’t replace it. The question is what to put in context at any given step.

How it’s implemented

The naive transformer is O(T²) in sequence length. For T = 1M, that’s infeasible. Several layers of tricks:

Architectural

  • GQA / MQA (Stage 06): reduces KV cache size.
  • Sparse / local attention: tokens only attend to a window or pattern.
  • Sliding window attention: each token attends to the last W tokens. Mistral uses 4k window with shared global tokens.
  • Hybrid models: mix attention with state-space models (Mamba, RWKV) for long-range without O(T²).
  • Multi-head latent attention (MLA): DeepSeek’s compression of K/V to a low-rank latent.

Positional

  • RoPE with high base (1M+): supports longer contexts directly.
  • YaRN, NTK-aware scaling, LongRoPE: extend RoPE-trained models post-hoc.
  • Position interpolation: smooth scaling of positional encodings.

Training

  • Curriculum: start with short contexts, lengthen progressively.
  • Continued pretraining at long context with a small high-quality long-context dataset.
  • Long-context RL: tune for retrieval and reasoning at long contexts.

Inference

  • FlashAttention v2/v3: tile-based attention, no full T×T matrix in memory.
  • Paged KV cache: allocate KV memory in pages, like virtual memory.
  • Prefix caching: reuse computed KV for shared prompt prefixes.

The “needle in a haystack” benchmark

A classic long-context test: hide a short piece of information (“The best place to eat in San Francisco is Tartine”) in a giant block of unrelated text, then ask a question about it.

Results vary by:

  • Position (where in the haystack the needle lives).
  • Topic (relevant vs. unrelated to surrounding text).
  • Distractors (how similar the surrounding noise is).

Frontier models (Claude, Gemini, GPT-4.x) score nearly perfectly on simple haystacks at 200k+ tokens. Multi-needle and reasoning over long context are harder — many models degrade past 32k–128k tokens even when haystack tests look perfect.

Long-context evaluation

Beyond the haystack test:

  • RULER (NVIDIA 2024): synthetic long-context tasks at varied lengths.
  • InfiniteBench, ZeroSCROLLS, LongBench: real-world tasks (summarization, multi-doc QA).
  • NoLiMa: needle-in-haystack with semantic (not exact-match) needles — much harder.
  • OpenAI MRCR, Anthropic’s NIAH variants: vendor-specific evals.

For your own application, build a long-context eval that mirrors your data.

When to use long context vs RAG

Rules of thumb:

ScenarioLong contextRAGBoth
Corpus < 100k tokens
Corpus < 1M tokenspossible
Corpus > 1M tokensimpractical
Need source citationshardeasyuse RAG, even with LC model
Latency-criticalnoyes
Cost-sensitivenoyes
Information across docs needs connectingyesneeds careRAG → reorder → LC

The modern hybrid: retrieve a tighter set of context (e.g. top 50 chunks → ~50k tokens), then let the LLM reason over the full retrieved set. Better than top-5-only RAG, much cheaper than dumping the whole corpus.

Long-context cost model

For a 1M-token prompt:

  • ~$5–$15 per query at frontier model prices (input tokens).
  • ~5–60 second processing time.
  • Reasoning models on top of long context: even more.

Caching helps:

  • Prompt caching (Anthropic, OpenAI, Gemini): cache the static portion of a long prompt at ~10× cheaper. If you query a 200k-token document repeatedly, only the new part is full-priced.

This makes “load a book and ask 50 questions” practical and cheap.

Practical patterns

  1. Always cache when reusing context. Saves cost dramatically.
  2. Place critical info early or late in the prompt. Models attend better to those positions.
  3. Use RAG to filter before long context — relevance > capacity.
  4. Test your specific use case — synthetic benchmarks don’t always predict.
  5. Measure latency under your real distribution.
  6. Avoid mid-context distractors — they confuse models more than positional limits.

What’s next

  • 10M+ token contexts with hybrid attention/SSM models (already research-grade).
  • Continuous online context — agents that maintain context across days.
  • Better long-context training data — synthetic generation pipelines for long-doc reasoning.
  • Architectural breakthroughs — Mamba-2, JEPA-style alternatives, retrieval-transformer hybrids.

See also