RAG Fundamentals

Retrieval-Augmented Generation: pull relevant text from a knowledge source, give it to an LLM, generate. That’s it. That’s also where 100 production failure modes hide.

The basic loop

Query
  ↓
Embed (or transform) query
  ↓
Retrieve top-k chunks from a knowledge store
  ↓
Construct prompt: instructions + retrieved chunks + query
  ↓
LLM generates answer
  ↓
(Optionally) Cite sources, validate, filter

Every component can be improved independently. Every component can fail independently.

Why RAG exists

LLMs:

Have a fixed knowledge cutoff.
Hallucinate when they don’t know something.
Can’t be retrained quickly for every new piece of information.
Have finite context (though long context is closing this gap — Stage 07).

RAG:

Brings up-to-date information at runtime.
Grounds answers in retrievable sources (auditability).
Lets you handle knowledge bases too large for any context window.
Can incorporate proprietary data without fine-tuning.

When NOT to use RAG

Knowledge fits in context (< 100k tokens): just put it in the prompt.
Static, stable knowledge: fine-tuning may be more efficient.
Reasoning-only tasks: no facts to retrieve. RAG adds noise.
Highly creative tasks: RAG biases toward retrieved examples.
Real-time retrieval is too slow: latency-critical paths.

The minimal RAG, in code

import os
import openai
import chromadb

client = openai.OpenAI()
db = chromadb.PersistentClient("./db").get_or_create_collection("notes")

def embed(text):
    return client.embeddings.create(
        input=text, model="text-embedding-3-small"
    ).data[0].embedding

def index(docs):
    for i, doc in enumerate(docs):
        db.add(ids=[str(i)], embeddings=[embed(doc)], documents=[doc])

def ask(query):
    results = db.query(query_embeddings=[embed(query)], n_results=3)
    context = "\n\n---\n\n".join(results["documents"][0])
    prompt = f"""Answer the question using the context. If unsure, say "I don't know."

<context>
{context}
</context>

Question: {query}"""
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

That’s a working RAG in ~25 lines. Now we’ll see why it fails in real use.

The five hardest things in RAG

1. Chunking

How do you slice your documents? Fixed token windows? Sentences? Sections? Get it wrong and:

A relevant fact is split across two chunks → retrieval misses it.
A chunk is too long → LLM gets distracted.
Chunks lose context (e.g. headings) → retrieval finds the right place but the LLM can’t use it.

Stage chunking-strategies.md goes deep.

2. Embedding quality

A bad embedding model means relevant chunks aren’t close to relevant queries. The choice of model is more impactful than 90% of other decisions. Domain-specific models often beat generic SOTA models.

3. Query transformation

User queries are often:

Ambiguous (“what about pricing?”)
Unstated context (“did the bug get fixed?”)
Stated as questions when keywords would retrieve better
Multi-part (“compare X and Y on Z”)

Naive retrieval on the raw query fails on all of these. Solutions: HyDE, query rewriting, decomposition (see advanced-retrieval-patterns.md).

4. Generation grounding

Even with perfect retrieval, the LLM may:

Ignore the retrieved context.
Hallucinate beyond what’s in the context.
Mix retrieved facts with prior knowledge confidently.

Mitigations:

Strict prompting (“only use the provided context”).
Source citation requirements.
Faithfulness evaluation (Stage evaluating-rag.md).

5. Evaluation

How do you know your RAG is good? Eyeballing 5 examples isn’t an eval. Stage evaluating-rag.md is dedicated to this.

Common failure modes

“I don’t know” when the answer is right there. Retrieval failed, or the chunk was malformed.
Confident wrong answers. The model hallucinated despite retrieval, or retrieval surfaced the wrong document.
Generic answers ignoring context. The model defaulted to its prior.
Cite-but-don’t-use. Model lists sources but answer doesn’t reflect them.
Multi-hop failure. Question requires combining two facts; RAG retrieves one well and one badly.
Stale data: index hasn’t been updated; LLM is “right” by training but indexed data says otherwise.
Identity confusion: in multi-tenant RAG, leaks across tenants if retrieval doesn’t filter.

RAG vs long context vs fine-tuning

Decision factor	RAG	Long context	Fine-tuning
Frequently changing knowledge	✓	possible	bad fit
Source citations needed	✓	hard	possible
Fits in 100k tokens	ok	✓	possible
Needs structured output / behavior change	sometimes	sometimes	✓
Latency-critical	ok	slower	fast
Proprietary, doesn’t change	ok	✓	✓

Often the answer is RAG + long context — retrieve a wider net, give the model more to work with.

Architecture diagrams

The simplest:

User → [Query] → [Embedder] → [Vector DB] → top-k chunks → [LLM] → Answer

A production-grade variant:

User → [Query rewriter / decomposer] → multiple sub-queries
                                          ↓
        [Hybrid search: dense + sparse]
                                          ↓
        [Reranker (cross-encoder)]
                                          ↓
        [Filter / dedupe / metadata constraints]
                                          ↓
        [Build prompt: instructions + chunks + citation rules]
                                          ↓
                                       [LLM]
                                          ↓
        [Faithfulness check / citation parse] → Answer with sources

Each box is something to optimize. Each is also a source of latency and cost.

Variants worth knowing

RAG: classic — retrieve, prompt, generate.
Self-RAG: model decides whether/when to retrieve mid-generation.
Corrective RAG: retrieve, evaluate, retrieve again if needed.
Agentic RAG: an agent loop that retrieves, reasons, retrieves more.
GraphRAG: build a knowledge graph from sources, retrieve via graph traversal.
HyDE: generate a hypothetical answer, embed it, retrieve based on that.
FLARE: retrieve mid-generation when the model’s confidence drops.

We unpack these in advanced-retrieval-patterns.md.

Practical advice

Start with a baseline. A naive RAG with text-embedding-3-small + top-5 retrieval + a clean prompt is often surprisingly competitive.
Build an eval set early. 50–100 representative queries with expected sources or answers. Without this, you’re flying blind.
Measure retrieval, then end-to-end. Recall@10 first; then faithfulness, helpfulness.
Add complexity only with evidence. Re-ranking, query decomposition, hybrid search — each should improve a measured metric.
Cache. Embeddings, prompts, sometimes generations.
Monitor in production. Log queries with retrievals; review weekly.

Watch it interactively

RAG Visualizer — real sentence-transformer embeddings + real BM25 + real RRF fusion on a small corpus. Toggle dense vs BM25 vs hybrid and see the rank shuffle. Predict before clicking: dense wins on paraphrases (“vegetarian options” matches “meatless food”); BM25 wins on rare keywords (“DATABASE_URL” exact-match). Hybrid catches both.
Chunking Strategies — paste your own text, switch between recursive/semantic/markdown chunkers, watch the cuts.
Reranker Lab — bi-encoder vs cross-encoder rerank on real precomputed scores. Shows the rank shuffle that makes top-1 actually correct.
Embedding Playground — pre-computed sentence-transformer vectors + live cosine similarity.

Build it in code

/ship/06 — RAG fundamentals (chunking) — three chunkers in code, with the trade-offs and a Chunker protocol you’ll actually use.
/ship/07 — vector store with sqlite-vec — the cheapest production-tier embedding + storage layer.
/ship/08 — hybrid retrieval — BM25 + dense + RRF + cross-encoder rerank in ~150 lines.
/case-studies/01 — docs assistant with citations — RAG as a real product, with citation discipline and refusal eval.