Stage 09 — Retrieval-Augmented Generation (RAG)

LLMs have a finite context window and can’t be retrained for every new piece of information. RAG is the engineering pattern that lets them work with arbitrary up-to-date knowledge: retrieve relevant passages, stuff them in the prompt, generate.

It sounds simple. Production RAG is one of the most subtle systems in modern AI.

Prerequisites

Stage 05 (embeddings)
Stage 08 (prompting, structured outputs)

Learning ladder

RAG fundamentals — the loop, when to use it, when not to
Chunking strategies — fixed, semantic, structural, late chunking
Embedding models for retrieval — choosing, evaluating, tuning
Vector databases — pgvector, Qdrant, LanceDB, Pinecone — pick one
Hybrid search & reranking — BM25 + dense, cross-encoders
Advanced retrieval patterns — HyDE, FLARE, query decomposition, GraphRAG
Evaluating RAG — retrieval@k, faithfulness, golden sets, RAGAs

MVU

You can:

Build a RAG end-to-end (chunk → embed → store → retrieve → generate)
List 5 ways your RAG can silently fail in production
Pick chunking and retrieval parameters with a defensible reason
Evaluate RAG quality with retrieval and end-to-end metrics

Exercise

Build a RAG over 1000 of your own notes/documents. Then break it:

Ambiguous queries
Multi-hop questions
Questions whose answer spans multiple documents
Questions where the right answer is “I don’t know”

For each failure mode, fix it and document what you changed.

Why this stage matters

RAG is the most common LLM application pattern in industry. Most AI startups have RAG inside them somewhere. Mediocre RAG is everywhere; great RAG is rare and worth a lot.

Hands-on companions

After the theory here, three concrete next stops:

Ship a production RAG stack:

/ship/06 — chunking the production way — three chunkers, the trade-offs, when to pick which
/ship/07 — embeddings + sqlite-vec — MiniLM-L6 + sqlite-vec, the cheapest production-tier vector store
/ship/08 — BM25 + dense + rerank — the 3-stage hybrid that beats any single strategy
/ship/13 — evaluating RAG in production — golden sets, drift detection, the feedback pipeline

See it as a real product:

/case-studies/01 — docs assistant with citations — RAG as a shipped product. Citation-first prompting, three-bucket refusal eval, real numbers (96% cite-coverage, 91% refusal precision).