Advanced Retrieval Patterns

Beyond the basic embed-and-retrieve pipeline, several techniques handle harder query patterns: ambiguity, multi-hop, abstract questions, knowledge across multiple documents.

Query rewriting

Reformulate the user query to be retrieval-friendly.

rewritten = llm("""
The user said: "{query}"

Rewrite as a search query optimized for retrieval. Be concise.
""")
results = retrieve(rewritten)

Useful for:

Conversational queries (“what about pricing?” → “ExampleCorp pricing”)
Vague questions (“tell me about it” with conversation history)
Adding domain keywords

Cost: one extra LLM call.

Query decomposition

Break a complex query into independent sub-queries.

sub_queries = llm("""
Decompose this question into independent retrieval queries:
"{query}"
""")
all_results = []
for sq in sub_queries:
    all_results.extend(retrieve(sq))
return dedupe(all_results)

Used for:

Multi-part questions (“compare X and Y on Z”)
Multi-hop questions (“when was the company founded by the CEO of X founded?”)
Aggregation questions (“list all cases where …”)

Step-back queries

Ask the model for a more general formulation, then retrieve on both:

Original: "What's the boiling point of water at 0.5 atm?"
Step-back: "How does pressure affect boiling point?"

Retrieve with both; combine results. Handy for fact-grounded reasoning.

HyDE (Hypothetical Document Embeddings)

Gao et al. (2023). Instead of embedding the query, generate a hypothetical answer and embed that.

hypothetical = llm("Write a passage that would answer: " + query)
results = retrieve(embed(hypothetical))

Why it works: the embedding of a “real” answer is closer in vector space to actual answer documents than the embedding of a question is.

Pros:

Handles short, keyword-poor queries.
Often improves recall measurably.

Cons:

Two LLM calls (slower).
The hypothetical can be wrong, biasing retrieval.

FLARE (Forward-Looking Active Retrieval)

Jiang et al. (2023). Retrieve mid-generation when the model’s confidence drops.

Generate a sentence
  ↓
Check token probabilities — any low-confidence?
  ↓
If yes: pause, retrieve based on the sentence so far, prepend, regenerate

Used for long-form generation where different sections need different retrieval. Complex to implement; powerful for grounded long outputs.

Self-RAG

Asai et al. (2023). The model is fine-tuned to decide whether to retrieve, what to retrieve, and how much to trust each retrieved passage:

[Retrieve?] → No / Yes → [Retrieve query] → [Score relevance] → [Generate]

Each step is a special token the model emits. Strong on benchmarks; requires a fine-tuned model. Some open implementations (self-rag series) exist.

Corrective RAG (CRAG)

Yan et al. (2024). After retrieval, evaluate whether the retrieved passages are good. If not, retrieve from a fallback (e.g. web search):

Retrieve top-k
  ↓
Evaluate relevance (cheap classifier or LLM)
  ↓
If low: trigger web search or alternate retrieval
  ↓
Filter / refine
  ↓
Generate

Good for systems with mixed-quality knowledge bases.

Agentic RAG

An agent (Stage 11) orchestrates retrieval over multiple tools and iterates:

Agent loop:
  - Decide what to look up
  - Call retrieval tool
  - Reason about result
  - Decide next: retrieve more, switch tool, answer, give up

Tools can include vector search, SQL, web search, internal APIs.

Used by ChatGPT’s browsing, Claude’s tool use, Perplexity’s pipeline. Slower and more expensive than naive RAG but handles complex queries.

GraphRAG

Microsoft’s pattern: build a knowledge graph from your corpus first; retrieve via graph traversal + community summaries.

Index time:
  - Extract entities and relationships from docs (LLM-driven)
  - Build a graph
  - Cluster the graph; summarize each community

Query time:
  - Map query to relevant communities
  - Retrieve community summaries + relevant chunks
  - Generate answer

Strong for global queries (“what are the major themes in this corpus?”) that defeat naive chunk retrieval. Expensive to index; cheap at query time once built.

RAPTOR

Sarthi et al. (2024). Recursively cluster chunks and summarize each cluster. Build a hierarchy of summaries above the raw chunks:

Raw chunks
  ↓ cluster + summarize
Cluster summaries
  ↓ cluster + summarize
Higher-level summaries
  ↓ ...

At query time, retrieve from any level. High-level for abstract queries; raw chunks for specific ones.

Multi-vector retrieval

Several vectors per chunk, different “facets”:

Title embedding
Content embedding
Question-style embedding (LLM-generated)
Summary embedding

Search all simultaneously; combine. Sometimes called “multi-representation” retrieval.

A simple version: index both the chunk and an LLM-generated question for that chunk; retrieve against both.

Late interaction (ColBERT)

We introduced this in embedding-models-for-retrieval.md. Multi-vector per chunk; query interacts with each token, max-pooled per chunk.

Higher quality than single-vector retrieval, but more storage and compute. Often used as a re-ranking step over top-100.

Time-aware retrieval

For knowledge that drifts:

Index documents with created_at and valid_until.
Boost more recent docs at retrieval time.
Filter explicitly when freshness matters.

Summary of when to use each

Pattern	When
Query rewriting	Conversational queries, vague queries
Query decomposition	Multi-part, multi-hop questions
HyDE	Short queries, semantic queries
FLARE	Long-form grounded generation
Self-RAG	When you can fine-tune; high-stakes grounding
Corrective RAG	Mixed-quality knowledge bases
Agentic RAG	Complex queries needing iteration
GraphRAG	Abstract / corpus-level queries
RAPTOR	Hierarchical knowledge, mixed query granularity
Late interaction (ColBERT)	High-precision rerank
Multi-vector	When chunks have multiple “angles”

Don’t reach for these too early

A naive RAG with good embeddings, hybrid search, and a re-ranker covers most cases. Add complexity when:

You’ve measured a specific failure mode.
You have an eval set to verify the new technique helps.
The complexity is justified by the gain.

Premature complexity is the #1 RAG anti-pattern. Each new layer is another thing to debug, monitor, and pay for.