Evaluating RAG

You can’t improve what you don’t measure. Most RAG systems in production have no real evaluation — engineers eyeball a few examples and ship. This is the single biggest gap in the field.

Two layers of eval

RAG has two stages, evaluated separately:

1. Retrieval evaluation

Did the retriever surface relevant documents?

Metrics:

Recall@k: of the truly relevant docs, what fraction appear in the top-k?
Precision@k: of the top-k retrieved, what fraction are relevant?
MRR (Mean Reciprocal Rank): 1 / position of first relevant doc, averaged.
nDCG (Normalized Discounted Cumulative Gain): relevance-weighted, position-discounted.

For RAG, recall@k is usually the primary metric. If the relevant doc isn’t in your top-k, the LLM can’t possibly answer correctly. Aim for ≥80% recall@10 before optimizing anything else.

2. End-to-end (generation) evaluation

Given the retrieved docs, did the LLM produce a good answer?

Metrics:

Faithfulness / groundedness: does the answer follow from the retrieved context?
Answer relevance: does the answer address the query?
Helpfulness: would a user find this useful?
Correctness (when ground truth exists): exact match, F1, semantic similarity.

Building a golden set

You need labeled examples. The single biggest investment in RAG quality is building a golden eval set.

A good golden set has:

50–500 representative queries drawn from real user behavior (or your best simulation of it).
For each: the expected source documents (for retrieval eval).
For each: an expected answer or accept criteria (for end-to-end eval).
Difficulty distribution: easy/medium/hard.
Type distribution: factual, multi-hop, ambiguous, “I don’t know” cases.

Build it iteratively. Start with 30 examples; expand as you discover failure modes.

Don’t sample from production in real-time as your golden set — production drifts; your eval becomes a moving target. Snapshot it.

Synthetic eval set generation

For scale, generate eval queries from your corpus:

for chunk in corpus.sample(100):
    question = llm(f"Generate a question this chunk would answer:\n{chunk}")
    eval_set.append({"query": question, "expected_doc_id": chunk.id})

Pros:

Scalable.
Forces you to think about retrieval explicitly.

Cons:

Synthetic queries don’t always match real user phrasing.
Bias toward what the LLM generates.

Best practice: use synthetic for scale + real-user-derived for distribution.

Faithfulness evaluation

The LLM-as-judge approach:

Given the context, the question, and the answer, was the answer fully supported by the context?

Context: ...
Question: ...
Answer: ...

Respond: yes / partially / no, with one sentence of justification.

Use a stronger LLM as judge if your generator is weak; or vice versa for cheaper eval.

Open frameworks:

RAGAs — popular RAG eval library; faithfulness, answer relevance, context relevance.
DeepEval — LLM-evals framework.
promptfoo — config-driven eval, easy CI integration.
Phoenix / Arize — production tracing + eval.
TruLens — RAG-specific eval and feedback.

Answer correctness

When you have a reference answer:

Exact match: too brittle for free-form answers.
F1 over tokens: classic QA metric; partial credit.
Semantic similarity (BERTScore, embedding cosine): catches paraphrases.
LLM-as-judge with a rubric: most flexible, most reliable for nuanced answers.

For LLM-as-judge:

Rubric:
- Correct: the answer states the same fact as the reference.
- Partial: some accurate parts, some missing or wrong.
- Wrong: contradicts the reference or makes up facts.

Reference: ...
Answer: ...

Verdict: ...
Reasoning: ...

Retrieval-only ablations

Test retrieval in isolation to find the bottleneck:

Recall@k with fixed-LLM: does the LLM answer correctly given the gold context? If not, fix the prompt or the LLM, not the retriever.
Generation@k with retrieved context: is the failure in retrieval or in generation? Compare given gold context vs. given retrieved context.

This tells you where to spend effort.

Common failure modes to test for

Build eval cases for each:

Easy lookup: “What’s the price of plan X?”
Multi-hop: “When was the company founded by the CEO of X founded?”
Ambiguous: “What did they say about pricing?” (need conversation history)
Adversarial: “Tell me about a feature that doesn’t exist.”
Multi-document: answer in one chunk, key context in another.
Negation: “Which products are NOT eligible for refunds?”
Aggregation: “How many cases mention X?”
Time-sensitive: “What’s the latest version?”
Out of corpus: should produce “I don’t know” / refusal.

Continuous eval in production

Production tracing:

Log every query: query, retrieved docs, answer, latency, cost.
Sample for eval: nightly, run faithfulness on a random 1% of yesterday’s traffic.
Surface drift: compare metrics week-over-week.
Human review queue: low-confidence or flagged answers go to a human for review.

Tools that help: Langfuse, Phoenix, Arize, LangSmith, Helicone.

A/B testing changes

When you change anything (chunking, embedder, prompt, model, reranker):

Run on the eval set first. If metrics drop, don’t ship.
If metrics improve, A/B test in production with a fraction of traffic.
Compare faithfulness, latency, cost, and user feedback before full rollout.

What “good” looks like

Rough numbers from real systems:

Recall@10: 80–95% (depends on corpus size and query difficulty).
Faithfulness rate: 90%+ on factual queries.
End-to-end correctness: 70–90% on a domain-specific eval, depending on difficulty.
Latency p95: 1–5s for naive, 5–15s for advanced pipelines.
Cost per query: $0.001 (cheap) to $0.10 (heavy reranking + reasoning).

If your numbers are way off, identify which stage is the bottleneck.

Pitfalls

Eyeballing 5 examples and shipping. Common, dangerous.
Eval set drift: refreshing the golden set every iteration so metrics always look better.
Optimizing the wrong metric: retrieval recall is great, but if generation is hallucinating, the user doesn’t care.
Ignoring “I don’t know” cases: a system that refuses helpful answers can score “high faithfulness” but be useless.
No latency / cost in your eval: a 100% accurate system that costs $1 per query is broken.

Watch it interactively

Confusion Matrix Lab — the same precision/recall framing, applied to a binary classifier. Drag the threshold; watch precision and recall trade. Same dynamics as recall@k vs precision in RAG.
LLM-as-Judge — tune rubric weights, watch the winner change. Predict before clicking: with weight 5 on factuality and 0 on tone, the more accurate but terser answer wins; flip the weights and the friendlier-but-wronger answer wins. Same answers, different rubric, different verdict.

Build it in code

/ship/04 — build an eval harness — TaskCase, rule-based grading, LLM-as-judge grading.
/ship/13 — evals in production — golden-set regression on every deploy, paired A/B prompt tests, drift detection on live traffic, feedback-to-eval pipeline.
/case-studies/01 — docs assistant — three-bucket refusal eval (out-of-scope / answerable / boundary) and the cite-correctness metric in action.