case-studies 05 / 05 builds on /ship/04, 13, 14, 17 30 min read · 2h hands-on

case study 05 · composes the /ship stack

The cheapest version of itself

Take the docs assistant from CS-01 and distill it into a 7B student. Same retrieval, same citation contract, ~7× cheaper, ~5pp eval gap. The product that pays for itself.

distillationcostproductionfine-tuningsynthetic-data

The product, again

Same shape as case-study 01:

POST /ask  { "question": "how do I use env vars in Astro?" }

→ { "answer": "...with citations [docs:env-vars/2]...",
    "citations": [{"id": "docs:env-vars/2", "url": "...", "score": 0.86}],
    "refused": false,
    "trace_id": "5a1c..." }

Same hybrid retriever (/ship/08). Same MDX-aware chunking. Same citation-first system prompt. Same three-bucket refusal eval (out-of-scope / answerable / boundary). The only thing that changes is the LLM that turns retrieved chunks + the question into the answer JSON.

We replace it with a student we trained ourselves. And we keep the teacher around for the queries the student can’t handle.

Why bother

The cost picture for CS-01, before any optimization:

       per /ask call:
         retrieval     ≈ 0.03¢   (BM25 + dense + rerank, our infra)
         LLM call      ≈ 1.20¢   (Sonnet-tier teacher, ~2k in / ~400 out)
       ─────────────────────────
         total         ≈ 1.23¢   ≈ $12.30 per 1k calls

       at 100k /ask/month: $1,230/mo · trending up
       at 1M  /ask/month:  $12,300/mo · the line item your CFO notices

The retrieval is cheap and you own it. The teacher is the bill.

What we want at the end of this case study:

       per /ask call (after distillation, 80/20 routing):
         retrieval     ≈ 0.03¢
         LLM call      ≈ 0.21¢   (80% student-served · 20% teacher-served)
       ─────────────────────────
         total         ≈ 0.24¢   ≈ $2.40 per 1k calls

       6.5× cheaper. Same UX. Same contract.

That’s the prize. Now the work to claim it.

Architecture — what changes vs CS-01

                  POST /ask


            ┌──────────────────┐
            │ HybridRetriever  │   ◀── unchanged from CS-01
            │  (/ship/06–08)   │
            └────────┬─────────┘
                     │  retrieved chunks + question

            ┌──────────────────┐
            │ Router           │   ◀── new (this case study)
            │  - long context? │
            │  - low retrieval │
            │    confidence?   │
            │  - escalation?   │
            └────────┬─────────┘

        ┌────────────┴────────────┐
        ▼                         ▼
   ┌─────────┐              ┌──────────┐
   │ Student │              │ Teacher  │
   │ (LoRA   │              │ (frontier│
   │  on 7B) │              │  API)    │
   └────┬────┘              └─────┬────┘
        │                         │
        └────────────┬────────────┘

            ┌──────────────────┐
            │ Output validator │   ◀── unchanged
            │  - cite-coverage │
            │  - schema OK     │
            │  - refusal sane  │
            └──────────────────┘

Three new things vs CS-01:

  1. The synthetic-data pipeline that turns ~200 hand-written seed questions into ~12k labeled training examples (using the teacher we already pay for in production).
  2. The distillation training loop that fine-tunes a 7B base with LoRA on the labeled data, using the soft-KL + hard-CE recipe from /ship/17.
  3. The router that decides per-call whether to send the prompt to the student or the teacher — a small classifier plus three hard-coded escape hatches.

Everything else is reused from CS-01.

The hard parts

1. Generating training data without contaminating the eval

The first pitfall every team hits. The eval set you measure against in CS-01 is also the most natural source of “good” training questions. Mix them, and the parity number you report at the end is meaningless.

The discipline:

# stack/distill_data.py
SEED_DIR    = "data/seed_prompts.jsonl"        # ~200 hand-written
EVAL_DIR    = "data/eval/golden_v3.jsonl"      # ~150 cases — frozen
TRAIN_DIR   = "data/distill_train.jsonl"       # generated, ~12k

def build_training_set(seeds: list[Question], teacher: TeacherAPI):
    # 1. Cluster seeds → pick medoids → expand each via paraphrase
    paraphrased = expand_via_paraphrase(seeds, teacher, T=0.9, n_per=8)

    # 2. Reject anything within 0.92 cosine of any eval question
    eval_qs = load_jsonl(EVAL_DIR)
    eval_emb = embed([q.text for q in eval_qs])
    paraphrased = [
        p for p in paraphrased
        if max(cosine(embed(p.text), e) for e in eval_emb) < 0.92
    ]

    # 3. Run each through the FULL CS-01 pipeline (retrieve → LLM → validate)
    #    This is critical — the student needs to learn the actual system
    #    behavior, not just the LLM's preferences.
    labeled = [run_full_pipeline(q, teacher) for q in paraphrased]

    # 4. Quality-filter via judge (5-pt rubric, drop scores < 4)
    labeled = [x for x in labeled if judge(x).score >= 4]

    # 5. Capture top-K logprobs from teacher for soft-KL training
    for x in labeled:
        x.teacher_logprobs = teacher.last_call.top_k_logprobs

    save_jsonl(TRAIN_DIR, labeled)

The key move is step 2 — reject training examples too close to any eval question. The 0.92 cosine threshold catches not just exact duplicates but the obvious paraphrases that creep in when the teacher generates from a seed it shares with eval. Without this step every team I’ve seen reports a 1% parity gap on first try, then catches the contamination, then sees the real ~5% gap.

The other key move is step 3 — label using the full pipeline, not just the bare LLM. The student needs to learn that “given these retrieved chunks, the right answer cites doc:env-vars/2 and doc:env-vars/3 in this format.” The teacher running over the bare question would generate a fundamentally different distribution.

2. The two-loss training loop

The actual training is the standard /ship/17 recipe, with one product-specific tweak:

def distill_step(student, teacher_logprobs, batch, T=2.0, alpha=0.3):
    # Forward pass on the student
    s_logits = student(batch.input_ids)
    s_logprobs = log_softmax(s_logits, dim=-1)

    # Hard cross-entropy on the teacher's chosen tokens (the gold tokens)
    L_hard = F.cross_entropy(s_logits.reshape(-1, V), batch.target_ids.reshape(-1))

    # Soft KL on the top-K teacher distribution per token, with temperature
    # We only have top-K teacher probs (not the full vocab); restrict the
    # student's softmax to the same K.
    s_topk = gather(s_logprobs, batch.teacher_top_k_ids) / T
    t_topk = teacher_logprobs / T
    L_soft = T * T * F.kl_div(s_topk, t_topk.exp(), reduction="batchmean")

    return alpha * L_hard + (1 - alpha) * L_soft

The product-specific tweak: we double the alpha for citation tokens. Tokens that look like [docs:env-vars/2] or ] after a citation get α=0.6 instead of α=0.3 — they’re the highest-stakes tokens in the output, the ones the eval grades exactly, and we want the hard-CE pressure on them.

def per_token_alpha(input_ids, base=0.3, citation_boost=2.0):
    # Detect tokens inside [docs:...] spans via a simple span detector.
    is_in_citation = mark_citation_tokens(input_ids)
    return where(is_in_citation, base * citation_boost, base)

Without this tweak, the student’s free-form prose is great but its citation IDs occasionally drift (it’ll cite docs:env-vars/3 when the right one is /2). With it, citation accuracy jumps from ~91% → ~96% — the kind of move that pays for the whole pipeline.

3. The router — turning “mostly equal” into “actually shippable”

After ~6 hours of LoRA training on a single A100, the student matches the teacher within ~5pp on the eval set. That’s good enough for ~80% of traffic but not all of it. A blanket cutover ships a quality regression.

So we route. The router is intentionally simple:

def route(question: str, retrieved: list[Chunk]) -> Model:
    # Hard escape #1: long context. Student is 7B with 8k window;
    # if the prompt+chunks blow past 6k tokens, send to teacher.
    if estimate_tokens(question, retrieved) > 6000:
        return teacher

    # Hard escape #2: low retrieval confidence. The hardest queries
    # for the student are the ones where the retriever is uncertain;
    # those are usually long-tail / out-of-scope and benefit from the
    # teacher's better refusal calibration.
    if max(c.score for c in retrieved) < 0.55:
        return teacher

    # Hard escape #3: explicit `prefer_quality=true` flag (debug,
    # admin tool, customer-success queries).
    if question.startswith("[admin]"):
        return teacher

    return student

That’s it. No ML for the router — three rules, hand-tuned on a held-out slice. About 20% of production queries hit one of those branches; the other 80% go to the student.

The 80/20 split isn’t a coincidence — it’s the same split /ship/17 recommends as the typical sweet spot. A more aggressive 90/10 saves more money but the eval gap widens to ~7pp and the long-tail user complaints start. A more cautious 60/40 keeps the gap under 2pp but the cost win shrinks to ~3×. 80/20 is the production sweet spot.

The full handler

# server/ask_v2.py
@app.post("/ask")
async def ask(req: AskRequest):
    trace_id = uuid.uuid4().hex
    with trace.start_as_current_span("ask", attributes={"trace_id": trace_id}):

        # 1. Retrieval (unchanged from CS-01)
        chunks = await retriever.retrieve(req.question, k=8)

        # 2. Route
        model = route(req.question, chunks)
        with trace.start_as_current_span("llm", attributes={"model": model.name}):
            raw = await model.complete(
                system=DOCS_SYSTEM_PROMPT,
                user=format_question_with_chunks(req.question, chunks),
            )

        # 3. Validate (unchanged from CS-01)
        validated = validate_response(raw, chunks)
        if not validated.ok:
            # Fallback: re-route to teacher if student emitted invalid output
            with trace.start_as_current_span("retry-on-teacher"):
                raw = await teacher.complete(
                    system=DOCS_SYSTEM_PROMPT,
                    user=format_question_with_chunks(req.question, chunks),
                )
                validated = validate_response(raw, chunks)

        return AskResponse(
            answer=validated.answer,
            citations=validated.citations,
            refused=validated.refused,
            trace_id=trace_id,
            served_by=model.name,   # ← new field, used in online eval
        )

The served_by field is small but important — it lets us slice production metrics by model in the observability dashboard, so we can tell when the student starts drifting on a class of queries the eval didn’t catch.

The eval results

Frozen eval set: 150 cases, three buckets (90 answerable in-scope / 30 boundary / 30 explicitly out-of-scope).

                          teacher   student   80/20-routed
                          ─────────────────────────────────
exact-match accuracy        76.7%    71.3%     75.3%
cite-coverage               94.1%    90.8%     93.7%
cite-correctness            96.3%    95.5%     96.1%
refusal precision           93.3%    90.0%     92.7%
refusal recall              90.0%    83.3%     88.7%

p50 latency (ms)             1240      390       720
p95 latency (ms)             3100      710      1860
$/1k calls                  $12.30    $1.45     $2.40

mean rubric score (1-5)      4.31     4.07      4.27

Three things to read out of this table:

  1. The student alone is not a drop-in. A 5pp drop on exact-match and a 7pp drop on refusal recall are both visible regressions. Refusal recall is the worst — the student over-answers boundary queries 7% more often than the teacher, which is exactly the failure mode we cannot tolerate.
  2. The router closes most of the gap. 80/20 routing brings exact-match within 1.4pp, cite-coverage within 0.4pp, refusal recall within 1.3pp, and rubric score within 0.04. All of those are below the noise floor of a 150-case eval (95% CI ≈ ±3pp at this size). Statistically, 80/20-routed is indistinguishable from teacher at this sample size.
  3. The cost-and-latency wins are not noise. The router hits 5× cost reduction and 70% p50 latency reduction. Those numbers are stable across evals because they’re driven by which model gets called, not what the model says.

The 80/20-routed product is the one you ship.

The ROI calculation

Up-front costs (one-time):

synthetic data generation       ~$120     (12k examples × ~1k tokens × $10/Mtok)
LoRA training (A100 × 6h)        ~$10     (cloud rental)
eng time (build + tune router)  ~3 days   (this case study, end to end)

Recurring win, at 100k /ask/month:

  before:  100,000 × $12.30/k = $1,230/mo
  after:   100,000 ×  $2.40/k =   $240/mo
  saved:                          $990/mo
  break-even on infra-only:     ~5 weeks

At 1M /ask/month:

  before:  $12,300/mo
  after:    $2,400/mo
  saved:    $9,900/mo
  break-even on infra-only:     <2 weeks

The recurring win scales linearly with traffic, which is the whole point of distillation — the up-front cost is fixed, the saving compounds. Every month after the break-even point is essentially free margin.

This is also why it’s worth doing after you have a working CS-01, not before. You need the eval, the retrieval stack, the validation, and the production traces to feed the synthetic-data pipeline. Distillation is a v2 move.

What we’d change in v2

  1. Retrain on production traces, not just synthetic data. Three months of served_by=teacher traces are a goldmine — they’re real queries the router already decided the student couldn’t handle. Train the next student on those and the router thresholds can ratchet tighter.
  2. Two students, one router. A 1B “fastest” student for clear in-scope queries (90% of traffic) and a 7B “midweight” student for ambiguous-but-handleable queries (10%) and the teacher for the rest. Three-tier routing typically saves another ~30%.
  3. Calibration on the student. The student’s confidence is poorly calibrated post-distillation (it inherits the teacher’s overconfidence and amplifies it). A temperature-scaling pass on the validation set, à la /demos/calibration, would tighten the refusal precision/recall numbers further.

Try this — predict the eval delta

Hands-on experiments to run after you’ve built it:

  1. Drop α to 0.0 (pure soft KL, no hard CE). Predict: the student fluently produces citation-shaped text but gets the actual [docs:env-vars/2] IDs wrong more often. Measure the cite-correctness drop. (Expected: ~10pp drop. The hard-CE on citation tokens is the single most important loss component for this product.)
  2. Drop the citation α-boost. Train at uniform α=0.3 across all tokens. Predict: prose stays great, citation accuracy drops ~5pp. Run the eval — the magnitude is what tells you whether per-token loss weighting is worth it on your product.
  3. Tighten the router to 90/10. Drop the retrieval-confidence threshold from 0.55 to 0.40. Predict: cost drops ~25%, refusal recall drops ~3pp (long-tail queries the student over-answers). Measure rubric score on the boundary bucket — that’s where you’ll see the regression first.
  4. Loosen the router to 60/40. Predict: cost climbs from $2.40/k → ~$5.50/k, eval gap shrinks to ≤1pp. The question this answers: how much are you willing to pay for the last percentage point of parity?
  5. Contaminate the training set on purpose. Skip the cosine-rejection step in build_training_set. Predict: training-set accuracy approaches 99% (looks great in the loss curve), eval set accuracy drops vs the clean run because the student overfit on near-duplicates. The lesson: train/eval contamination inflates training quality — exactly the wrong-direction tell.
  6. Try a different student base (Mistral-7B → Qwen2-7B → Llama-3-8B). Predict: the rank order on your specific task is unpredictable, but the gap between best and worst is usually 2–4pp. The exercise: build the parity table from /ship/17 for your corpus.

Each of these is a one-flag-flip experiment. Together they’re the calibration intuition that turns “I followed the recipe” into “I know which knob to turn next.”

Cross-references

Anchored to:

Demos to run alongside:

  • Distillation Lab — interactive: real GPT-2 teacher logits, learnable student, KL gradient. Set α=0.3, T=2.0 to match this case study’s recipe; watch the student’s distribution converge over ~50 steps.
  • Calibration Lab — the v2 follow-up. Slide T from 1.0 → 2.0 on a held-out validation set to tune the student’s confidence post-distillation.
  • LLM-as-Judge — the rubric we use for the quality-filter step in build_training_set and for the production rubric score in the eval table.
  • Cost & Latency Calculator — predict the savings before you build. Compare “frontier all the way” vs “80/20 routed to a 7B student” on the input/output token sliders.

Adjacent articles:

What this case study taught vs /ship

/ship/17 gives you the recipe. CS-05 gives you four product-specific moves the recipe can’t:

  • Cosine-rejection of training examples against the eval set. Generic recipes assume your eval is held out elsewhere; in practice the same hand-curation that builds the eval also builds the seed prompts, and contamination is the default failure mode.
  • Per-token loss weighting. When your output has high-stakes substructure (citation IDs, JSON keys, code identifiers), boosting α on those tokens is the single highest-ROI tweak. The recipe doesn’t tell you this because it’s product-specific by definition.
  • A three-rule router. The literature talks about learned routing classifiers; in production a hand-tuned escape-hatch list outperforms them on day one and gives you something you can debug. Add learning later if the rules stop holding.
  • Validator-on-fallback. When the student emits invalid JSON or a missing citation, automatically re-route the query to the teacher rather than returning an error. Costs ~3% extra teacher calls; saves ~100% of the user-visible failures.

Wrapping the case-studies arc

Five case studies, four products, one stack:

  • CS-01 — docs assistant. RAG + citation contract. The most-built LLM product.
  • CS-02 — code-review agent. Tool-use + action-rate metric. The most-shipped agent.
  • CS-03 — research assistant. Multi-agent orchestration + cost decomposition. The hardest to deploy.
  • CS-04 — customer-support bot. RAG + tools + escalation. The product that composes the entire stack.
  • CS-05 — the cheapest version of itself. Distillation + routing on top of CS-01. The product that pays for itself.

The arc: pick a product, ship it on a frontier model, measure it honestly, then squeeze the cost out of it without giving up the contract. CS-01 → CS-05 is the full lifecycle.

Every product ends here. You ship on the frontier. You measure. You distill. You route. Repeat.