Stage 09 — Retrieval-Augmented Generation (RAG)

LLMs have a finite context window and can’t be retrained for every new piece of information. RAG is the engineering pattern that lets them work with arbitrary up-to-date knowledge: retrieve relevant passages, stuff them in the prompt, generate.

It sounds simple. Production RAG is one of the most subtle systems in modern AI.

Prerequisites

  • Stage 05 (embeddings)
  • Stage 08 (prompting, structured outputs)

Learning ladder

  1. RAG fundamentals — the loop, when to use it, when not to
  2. Chunking strategies — fixed, semantic, structural, late chunking
  3. Embedding models for retrieval — choosing, evaluating, tuning
  4. Vector databases — pgvector, Qdrant, LanceDB, Pinecone — pick one
  5. Hybrid search & reranking — BM25 + dense, cross-encoders
  6. Advanced retrieval patterns — HyDE, FLARE, query decomposition, GraphRAG
  7. Evaluating RAG — retrieval@k, faithfulness, golden sets, RAGAs

MVU

You can:

  • Build a RAG end-to-end (chunk → embed → store → retrieve → generate)
  • List 5 ways your RAG can silently fail in production
  • Pick chunking and retrieval parameters with a defensible reason
  • Evaluate RAG quality with retrieval and end-to-end metrics

Exercise

Build a RAG over 1000 of your own notes/documents. Then break it:

  • Ambiguous queries
  • Multi-hop questions
  • Questions whose answer spans multiple documents
  • Questions where the right answer is “I don’t know”

For each failure mode, fix it and document what you changed.

Why this stage matters

RAG is the most common LLM application pattern in industry. Most AI startups have RAG inside them somewhere. Mediocre RAG is everywhere; great RAG is rare and worth a lot.

Hands-on companions

After the theory here, three concrete next stops:

Ship a production RAG stack:

See it as a real product:

See also