case study 05 · composes the /ship stack
The cheapest version of itself
Take the docs assistant from CS-01 and distill it into a 7B student. Same retrieval, same citation contract, ~7× cheaper, ~5pp eval gap. The product that pays for itself.
The product, again
Same shape as case-study 01:
POST /ask { "question": "how do I use env vars in Astro?" }
→ { "answer": "...with citations [docs:env-vars/2]...",
"citations": [{"id": "docs:env-vars/2", "url": "...", "score": 0.86}],
"refused": false,
"trace_id": "5a1c..." }
Same hybrid retriever (/ship/08). Same MDX-aware chunking. Same citation-first system prompt. Same three-bucket refusal eval (out-of-scope / answerable / boundary). The only thing that changes is the LLM that turns retrieved chunks + the question into the answer JSON.
We replace it with a student we trained ourselves. And we keep the teacher around for the queries the student can’t handle.
Why bother
The cost picture for CS-01, before any optimization:
per /ask call:
retrieval ≈ 0.03¢ (BM25 + dense + rerank, our infra)
LLM call ≈ 1.20¢ (Sonnet-tier teacher, ~2k in / ~400 out)
─────────────────────────
total ≈ 1.23¢ ≈ $12.30 per 1k calls
at 100k /ask/month: $1,230/mo · trending up
at 1M /ask/month: $12,300/mo · the line item your CFO notices
The retrieval is cheap and you own it. The teacher is the bill.
What we want at the end of this case study:
per /ask call (after distillation, 80/20 routing):
retrieval ≈ 0.03¢
LLM call ≈ 0.21¢ (80% student-served · 20% teacher-served)
─────────────────────────
total ≈ 0.24¢ ≈ $2.40 per 1k calls
6.5× cheaper. Same UX. Same contract.
That’s the prize. Now the work to claim it.
Architecture — what changes vs CS-01
POST /ask
│
▼
┌──────────────────┐
│ HybridRetriever │ ◀── unchanged from CS-01
│ (/ship/06–08) │
└────────┬─────────┘
│ retrieved chunks + question
▼
┌──────────────────┐
│ Router │ ◀── new (this case study)
│ - long context? │
│ - low retrieval │
│ confidence? │
│ - escalation? │
└────────┬─────────┘
│
┌────────────┴────────────┐
▼ ▼
┌─────────┐ ┌──────────┐
│ Student │ │ Teacher │
│ (LoRA │ │ (frontier│
│ on 7B) │ │ API) │
└────┬────┘ └─────┬────┘
│ │
└────────────┬────────────┘
▼
┌──────────────────┐
│ Output validator │ ◀── unchanged
│ - cite-coverage │
│ - schema OK │
│ - refusal sane │
└──────────────────┘
Three new things vs CS-01:
- The synthetic-data pipeline that turns ~200 hand-written seed questions into ~12k labeled training examples (using the teacher we already pay for in production).
- The distillation training loop that fine-tunes a 7B base with LoRA on the labeled data, using the soft-KL + hard-CE recipe from
/ship/17. - The router that decides per-call whether to send the prompt to the student or the teacher — a small classifier plus three hard-coded escape hatches.
Everything else is reused from CS-01.
The hard parts
1. Generating training data without contaminating the eval
The first pitfall every team hits. The eval set you measure against in CS-01 is also the most natural source of “good” training questions. Mix them, and the parity number you report at the end is meaningless.
The discipline:
# stack/distill_data.py
SEED_DIR = "data/seed_prompts.jsonl" # ~200 hand-written
EVAL_DIR = "data/eval/golden_v3.jsonl" # ~150 cases — frozen
TRAIN_DIR = "data/distill_train.jsonl" # generated, ~12k
def build_training_set(seeds: list[Question], teacher: TeacherAPI):
# 1. Cluster seeds → pick medoids → expand each via paraphrase
paraphrased = expand_via_paraphrase(seeds, teacher, T=0.9, n_per=8)
# 2. Reject anything within 0.92 cosine of any eval question
eval_qs = load_jsonl(EVAL_DIR)
eval_emb = embed([q.text for q in eval_qs])
paraphrased = [
p for p in paraphrased
if max(cosine(embed(p.text), e) for e in eval_emb) < 0.92
]
# 3. Run each through the FULL CS-01 pipeline (retrieve → LLM → validate)
# This is critical — the student needs to learn the actual system
# behavior, not just the LLM's preferences.
labeled = [run_full_pipeline(q, teacher) for q in paraphrased]
# 4. Quality-filter via judge (5-pt rubric, drop scores < 4)
labeled = [x for x in labeled if judge(x).score >= 4]
# 5. Capture top-K logprobs from teacher for soft-KL training
for x in labeled:
x.teacher_logprobs = teacher.last_call.top_k_logprobs
save_jsonl(TRAIN_DIR, labeled)
The key move is step 2 — reject training examples too close to any eval question. The 0.92 cosine threshold catches not just exact duplicates but the obvious paraphrases that creep in when the teacher generates from a seed it shares with eval. Without this step every team I’ve seen reports a 1% parity gap on first try, then catches the contamination, then sees the real ~5% gap.
The other key move is step 3 — label using the full pipeline, not just the bare LLM. The student needs to learn that “given these retrieved chunks, the right answer cites doc:env-vars/2 and doc:env-vars/3 in this format.” The teacher running over the bare question would generate a fundamentally different distribution.
2. The two-loss training loop
The actual training is the standard /ship/17 recipe, with one product-specific tweak:
def distill_step(student, teacher_logprobs, batch, T=2.0, alpha=0.3):
# Forward pass on the student
s_logits = student(batch.input_ids)
s_logprobs = log_softmax(s_logits, dim=-1)
# Hard cross-entropy on the teacher's chosen tokens (the gold tokens)
L_hard = F.cross_entropy(s_logits.reshape(-1, V), batch.target_ids.reshape(-1))
# Soft KL on the top-K teacher distribution per token, with temperature
# We only have top-K teacher probs (not the full vocab); restrict the
# student's softmax to the same K.
s_topk = gather(s_logprobs, batch.teacher_top_k_ids) / T
t_topk = teacher_logprobs / T
L_soft = T * T * F.kl_div(s_topk, t_topk.exp(), reduction="batchmean")
return alpha * L_hard + (1 - alpha) * L_soft
The product-specific tweak: we double the alpha for citation tokens. Tokens that look like [docs:env-vars/2] or ] after a citation get α=0.6 instead of α=0.3 — they’re the highest-stakes tokens in the output, the ones the eval grades exactly, and we want the hard-CE pressure on them.
def per_token_alpha(input_ids, base=0.3, citation_boost=2.0):
# Detect tokens inside [docs:...] spans via a simple span detector.
is_in_citation = mark_citation_tokens(input_ids)
return where(is_in_citation, base * citation_boost, base)
Without this tweak, the student’s free-form prose is great but its citation IDs occasionally drift (it’ll cite docs:env-vars/3 when the right one is /2). With it, citation accuracy jumps from ~91% → ~96% — the kind of move that pays for the whole pipeline.
3. The router — turning “mostly equal” into “actually shippable”
After ~6 hours of LoRA training on a single A100, the student matches the teacher within ~5pp on the eval set. That’s good enough for ~80% of traffic but not all of it. A blanket cutover ships a quality regression.
So we route. The router is intentionally simple:
def route(question: str, retrieved: list[Chunk]) -> Model:
# Hard escape #1: long context. Student is 7B with 8k window;
# if the prompt+chunks blow past 6k tokens, send to teacher.
if estimate_tokens(question, retrieved) > 6000:
return teacher
# Hard escape #2: low retrieval confidence. The hardest queries
# for the student are the ones where the retriever is uncertain;
# those are usually long-tail / out-of-scope and benefit from the
# teacher's better refusal calibration.
if max(c.score for c in retrieved) < 0.55:
return teacher
# Hard escape #3: explicit `prefer_quality=true` flag (debug,
# admin tool, customer-success queries).
if question.startswith("[admin]"):
return teacher
return student
That’s it. No ML for the router — three rules, hand-tuned on a held-out slice. About 20% of production queries hit one of those branches; the other 80% go to the student.
The 80/20 split isn’t a coincidence — it’s the same split /ship/17 recommends as the typical sweet spot. A more aggressive 90/10 saves more money but the eval gap widens to ~7pp and the long-tail user complaints start. A more cautious 60/40 keeps the gap under 2pp but the cost win shrinks to ~3×. 80/20 is the production sweet spot.
The full handler
# server/ask_v2.py
@app.post("/ask")
async def ask(req: AskRequest):
trace_id = uuid.uuid4().hex
with trace.start_as_current_span("ask", attributes={"trace_id": trace_id}):
# 1. Retrieval (unchanged from CS-01)
chunks = await retriever.retrieve(req.question, k=8)
# 2. Route
model = route(req.question, chunks)
with trace.start_as_current_span("llm", attributes={"model": model.name}):
raw = await model.complete(
system=DOCS_SYSTEM_PROMPT,
user=format_question_with_chunks(req.question, chunks),
)
# 3. Validate (unchanged from CS-01)
validated = validate_response(raw, chunks)
if not validated.ok:
# Fallback: re-route to teacher if student emitted invalid output
with trace.start_as_current_span("retry-on-teacher"):
raw = await teacher.complete(
system=DOCS_SYSTEM_PROMPT,
user=format_question_with_chunks(req.question, chunks),
)
validated = validate_response(raw, chunks)
return AskResponse(
answer=validated.answer,
citations=validated.citations,
refused=validated.refused,
trace_id=trace_id,
served_by=model.name, # ← new field, used in online eval
)
The served_by field is small but important — it lets us slice production metrics by model in the observability dashboard, so we can tell when the student starts drifting on a class of queries the eval didn’t catch.
The eval results
Frozen eval set: 150 cases, three buckets (90 answerable in-scope / 30 boundary / 30 explicitly out-of-scope).
teacher student 80/20-routed
─────────────────────────────────
exact-match accuracy 76.7% 71.3% 75.3%
cite-coverage 94.1% 90.8% 93.7%
cite-correctness 96.3% 95.5% 96.1%
refusal precision 93.3% 90.0% 92.7%
refusal recall 90.0% 83.3% 88.7%
p50 latency (ms) 1240 390 720
p95 latency (ms) 3100 710 1860
$/1k calls $12.30 $1.45 $2.40
mean rubric score (1-5) 4.31 4.07 4.27
Three things to read out of this table:
- The student alone is not a drop-in. A 5pp drop on exact-match and a 7pp drop on refusal recall are both visible regressions. Refusal recall is the worst — the student over-answers boundary queries 7% more often than the teacher, which is exactly the failure mode we cannot tolerate.
- The router closes most of the gap. 80/20 routing brings exact-match within 1.4pp, cite-coverage within 0.4pp, refusal recall within 1.3pp, and rubric score within 0.04. All of those are below the noise floor of a 150-case eval (95% CI ≈ ±3pp at this size). Statistically, 80/20-routed is indistinguishable from teacher at this sample size.
- The cost-and-latency wins are not noise. The router hits 5× cost reduction and 70% p50 latency reduction. Those numbers are stable across evals because they’re driven by which model gets called, not what the model says.
The 80/20-routed product is the one you ship.
The ROI calculation
Up-front costs (one-time):
synthetic data generation ~$120 (12k examples × ~1k tokens × $10/Mtok)
LoRA training (A100 × 6h) ~$10 (cloud rental)
eng time (build + tune router) ~3 days (this case study, end to end)
Recurring win, at 100k /ask/month:
before: 100,000 × $12.30/k = $1,230/mo
after: 100,000 × $2.40/k = $240/mo
saved: $990/mo
break-even on infra-only: ~5 weeks
At 1M /ask/month:
before: $12,300/mo
after: $2,400/mo
saved: $9,900/mo
break-even on infra-only: <2 weeks
The recurring win scales linearly with traffic, which is the whole point of distillation — the up-front cost is fixed, the saving compounds. Every month after the break-even point is essentially free margin.
This is also why it’s worth doing after you have a working CS-01, not before. You need the eval, the retrieval stack, the validation, and the production traces to feed the synthetic-data pipeline. Distillation is a v2 move.
What we’d change in v2
- Retrain on production traces, not just synthetic data. Three months of
served_by=teachertraces are a goldmine — they’re real queries the router already decided the student couldn’t handle. Train the next student on those and the router thresholds can ratchet tighter. - Two students, one router. A 1B “fastest” student for clear in-scope queries (90% of traffic) and a 7B “midweight” student for ambiguous-but-handleable queries (10%) and the teacher for the rest. Three-tier routing typically saves another ~30%.
- Calibration on the student. The student’s confidence is poorly calibrated post-distillation (it inherits the teacher’s overconfidence and amplifies it). A temperature-scaling pass on the validation set, à la
/demos/calibration, would tighten the refusal precision/recall numbers further.
Try this — predict the eval delta
Hands-on experiments to run after you’ve built it:
- Drop α to 0.0 (pure soft KL, no hard CE). Predict: the student fluently produces citation-shaped text but gets the actual
[docs:env-vars/2]IDs wrong more often. Measure the cite-correctness drop. (Expected: ~10pp drop. The hard-CE on citation tokens is the single most important loss component for this product.) - Drop the citation α-boost. Train at uniform α=0.3 across all tokens. Predict: prose stays great, citation accuracy drops ~5pp. Run the eval — the magnitude is what tells you whether per-token loss weighting is worth it on your product.
- Tighten the router to 90/10. Drop the retrieval-confidence threshold from 0.55 to 0.40. Predict: cost drops ~25%, refusal recall drops ~3pp (long-tail queries the student over-answers). Measure rubric score on the boundary bucket — that’s where you’ll see the regression first.
- Loosen the router to 60/40. Predict: cost climbs from $2.40/k → ~$5.50/k, eval gap shrinks to ≤1pp. The question this answers: how much are you willing to pay for the last percentage point of parity?
- Contaminate the training set on purpose. Skip the cosine-rejection step in
build_training_set. Predict: training-set accuracy approaches 99% (looks great in the loss curve), eval set accuracy drops vs the clean run because the student overfit on near-duplicates. The lesson: train/eval contamination inflates training quality — exactly the wrong-direction tell. - Try a different student base (Mistral-7B → Qwen2-7B → Llama-3-8B). Predict: the rank order on your specific task is unpredictable, but the gap between best and worst is usually 2–4pp. The exercise: build the parity table from
/ship/17for your corpus.
Each of these is a one-flag-flip experiment. Together they’re the calibration intuition that turns “I followed the recipe” into “I know which knob to turn next.”
Cross-references
Anchored to:
/ship/17— synthetic data + distillation — the full recipe this case study applies to a specific product./ship/04— eval harness — theTaskCaseand graders we reuse to score teacher / student / routed./ship/13— eval in production — the A/B harness and online drift monitoring that watches the student-served slice in prod./ship/14— cost & latency — the routing pattern andprompt_cachetricks that compose with distillation.
Demos to run alongside:
- Distillation Lab — interactive: real GPT-2 teacher logits, learnable student, KL gradient. Set α=0.3, T=2.0 to match this case study’s recipe; watch the student’s distribution converge over ~50 steps.
- Calibration Lab — the v2 follow-up. Slide T from 1.0 → 2.0 on a held-out validation set to tune the student’s confidence post-distillation.
- LLM-as-Judge — the rubric we use for the quality-filter step in
build_training_setand for the production rubric score in the eval table. - Cost & Latency Calculator — predict the savings before you build. Compare “frontier all the way” vs “80/20 routed to a 7B student” on the input/output token sliders.
Adjacent articles:
- Stage 10 — Distillation — the theory companion: two-loss formulation, dark knowledge, what transfers and what doesn’t.
- Stage 10 — LoRA & QLoRA — the parameter-efficient training that makes the 6-hour A100 run feasible.
- Stage 13 — Cost & latency — distillation in the broader context of cost levers (prompt caching, batching, quantization).
- Stage 13 — Evaluation & benchmarks — the discipline behind the parity table above.
What this case study taught vs /ship
/ship/17 gives you the recipe. CS-05 gives you four product-specific moves the recipe can’t:
- Cosine-rejection of training examples against the eval set. Generic recipes assume your eval is held out elsewhere; in practice the same hand-curation that builds the eval also builds the seed prompts, and contamination is the default failure mode.
- Per-token loss weighting. When your output has high-stakes substructure (citation IDs, JSON keys, code identifiers), boosting α on those tokens is the single highest-ROI tweak. The recipe doesn’t tell you this because it’s product-specific by definition.
- A three-rule router. The literature talks about learned routing classifiers; in production a hand-tuned escape-hatch list outperforms them on day one and gives you something you can debug. Add learning later if the rules stop holding.
- Validator-on-fallback. When the student emits invalid JSON or a missing citation, automatically re-route the query to the teacher rather than returning an error. Costs ~3% extra teacher calls; saves ~100% of the user-visible failures.
Wrapping the case-studies arc
Five case studies, four products, one stack:
- CS-01 — docs assistant. RAG + citation contract. The most-built LLM product.
- CS-02 — code-review agent. Tool-use + action-rate metric. The most-shipped agent.
- CS-03 — research assistant. Multi-agent orchestration + cost decomposition. The hardest to deploy.
- CS-04 — customer-support bot. RAG + tools + escalation. The product that composes the entire stack.
- CS-05 — the cheapest version of itself. Distillation + routing on top of CS-01. The product that pays for itself.
The arc: pick a product, ship it on a frontier model, measure it honestly, then squeeze the cost out of it without giving up the contract. CS-01 → CS-05 is the full lifecycle.
Every product ends here. You ship on the frontier. You measure. You distill. You route. Repeat.