Distillation
Take a frontier model and copy its behavior into a small model. The small model runs ~10× faster and costs ~10× less, with a quality gap typically in the single digits of percentage points. Distillation is how the production cheap-model exists.
This is the most-used compression technique in 2026 production AI. Every time you use Haiku, GPT-4o-mini, Gemini Flash, or any “small fast” tier of a model family, you’re using a distilled descendant of the frontier one.
The core idea
Hinton, Vinyals, Dean (2015). A small “student” network learns from the outputs of a large “teacher” network, not just from the labels.
The teacher’s softmax distribution carries more information than a one-hot label. If a teacher classifies an image as 80% dog / 15% wolf / 5% cat, that’s three useful signals — the shape of the distribution tells the student that dog and wolf are similar, and that cat is far from both. Hinton called this dark knowledge.
hard label: [0, 0, 1, 0, ...] # one-hot on "dog"
teacher dist: [0.05, 0.15, 0.80, 0.00, 0.00, ...] # full probability vector
A student trained on the second signal converges faster, generalizes better, and handles ambiguous inputs more like the teacher.
The two-loss formulation
In practice you almost always combine both signals:
L_total = α · L_hard + (1 − α) · L_soft
L_hard = cross-entropy(student, one_hot_label) # standard SFT loss
L_soft = T² · KL(softmax(t/T) || softmax(s/T)) # KL on softened logits
Three things to notice:
- Temperature
Tsoftens both the teacher and student distributions before the KL is computed.T = 1is the raw distribution;T > 1flattens it. Common values:T ∈ [2, 5]. - The
T²weight onL_softis Hinton’s trick. RaisingTflattens the softmax derivative; multiplying byT²puts the gradient back on the same scale as the hard loss, so you can mix them with a cleanαwithout re-tuning the learning rate. αcontrols the mix.α = 1is pure SFT (no dark knowledge).α = 0is pure distillation. Most teams pickα ∈ [0.1, 0.5]— soft loss dominates, hard loss anchors the answer.
Why temperature matters
This is the single most important hyperparameter and the one most teams get wrong on first try.
At T = 1, the teacher’s softmax is sharp: typically one token gets 60–95% of the mass, and the runners-up are tiny. The student’s gradient barely sees the runners-up, so distillation degenerates into “match the top-1” — which you could’ve done with a one-hot label.
At T = 3, the teacher’s distribution flattens. The runners-up now hold meaningful mass, and the student gets gradient signal on the relative ordering of plausible alternatives. That ordering is the dark knowledge.
At T = 10, you’ve over-flattened. Differences between alternatives wash out into near-uniform; the gradient signal becomes noise.
There’s a sweet spot, usually T ∈ [2, 5], that you find empirically on a held-out set. The /demos/distillation lab lets you watch this directly on real GPT-2 logits — slide T from 1 → 5 and the student’s bars track the teacher’s softening in real time.
What you actually distill
Three distinct things often get bundled under “distillation”:
1. Response distillation (the cheap one)
Generate (prompt, teacher_response) pairs. SFT the student on those pairs. Loss is just standard cross-entropy on the teacher’s chosen tokens.
This is what most “we distilled GPT-4 into a 7B” projects actually mean. Cheap, simple, no logprobs needed. Quality gap: ~5–15pp vs the teacher.
2. Logit distillation (the right one for quality)
Capture the teacher’s full or top-K logits at every output position. Train the student with the L_soft KL term on top of L_hard. This is what Hinton’s paper describes.
Requires the teacher to give you logits — which means open-weights teachers (or labs that expose logprobs in their API). Quality gap typically half that of response-only distillation.
3. Feature distillation (the fancy one)
Train the student to match the teacher’s intermediate representations, not just outputs. Loss is something like MSE(student_layer_k, teacher_layer_j). Used in vision a lot; less common in LLMs because architectures usually differ.
Most production LLM distillation is #1 or #2. #3 is a research move.
Top-K logit distillation
Capturing the full vocabulary distribution per token (50k+ floats) is expensive at scale. The standard production move is top-K logit distillation: capture only the top 16–64 logits per position.
def label_with_teacher(prompts, teacher, K=16):
out = []
for p in prompts:
with no_grad():
logits, choice = teacher.generate(p, return_logits=True)
topk_vals, topk_ids = torch.topk(logits, k=K, dim=-1)
out.append({
"prompt_ids": p,
"target_ids": choice,
"teacher_topk_ids": topk_ids, # which tokens
"teacher_topk_vals": topk_vals, # their logits
})
return out
At training time, the student computes its full softmax but only the top-K KL term:
def distill_step(student, batch, T=2.0, alpha=0.3):
s_logits = student(batch.input_ids)
L_hard = F.cross_entropy(s_logits.flatten(0, 1), batch.target_ids.flatten())
# Gather student logits at the teacher's top-K positions
s_topk = torch.gather(s_logits, dim=-1, index=batch.teacher_topk_ids)
s_logp_T = F.log_softmax(s_topk / T, dim=-1)
t_p_T = F.softmax(batch.teacher_topk_vals / T, dim=-1)
L_soft = (T * T) * F.kl_div(s_logp_T, t_p_T, reduction="batchmean")
return alpha * L_hard + (1 - alpha) * L_soft
Storage cost: ~64 floats per output token instead of 50k. Quality cost: under 1pp on most tasks (the tail mass is genuinely small).
Synthetic data — where the prompts come from
The teacher’s distribution only matters at the prompts you actually train on. Picking those prompts is its own discipline:
- Seed prompts (~100–500 hand-written) — the questions you actually want the student to handle well.
- Paraphrase expansion — feed seeds to the teacher at temperature 0.9 with a “rephrase this” prompt; get 10× variants. Filter for diversity (cosine threshold).
- Quality filter — judge each generated example with the teacher itself on a 1–5 rubric; drop scores under 4.
- Dedupe — embed all examples; reject any pair within ~0.95 cosine.
- Eval contamination check — reject any example within ~0.92 cosine of any holdout eval question. The single most-skipped step, and the one that makes parity numbers meaningless when skipped.
Steps 1–5 are the recipe in /ship/17 — synthetic data + distillation, with runnable code. The case study /case-studies/05 applies it end-to-end to a docs assistant.
What distillation transfers, and what it doesn’t
Transfers well:
- Output format (JSON shape, citation patterns, code style).
- Decision boundaries on the in-distribution task.
- Calibration (when paired with logit distillation at
T > 1). - Refusal behavior on prompts similar to training.
Transfers poorly:
- Long-tail knowledge the student doesn’t have capacity for.
- Reasoning chains longer than the student can sustain.
- Out-of-distribution generalization — the student inherits the teacher’s failures and adds its own.
- Multi-step planning when the teacher has a much larger context window.
The student inherits the teacher’s flaws. If the teacher hallucinates on questions about August 2024 events, the student will hallucinate the same way. Distillation is faithful imitation, not improvement.
Production routing — the cheat code
In practice, the student doesn’t have to handle everything. Pair it with a router that sends the hard cases to the teacher:
def route(query, retrieved_chunks):
if estimate_tokens(query, retrieved_chunks) > STUDENT_CONTEXT_LIMIT:
return teacher
if retrieval_confidence(retrieved_chunks) < 0.55:
return teacher
if requires_long_reasoning(query):
return teacher
return student
A 7B student handles ~80% of production traffic; the teacher handles the long tail. Net cost is dominated by the student. Net quality is dominated by the teacher on hard queries. Distillation + routing is the production sweet spot, not distillation alone.
The /case-studies/05 writeup walks through the routing rules for a real docs assistant. The headline number: 80/20 routing closes a 5pp eval gap to within ~1pp while keeping the cost win at ~5×.
When distillation is right, and when it isn’t
Worth doing:
- You’re paying real money on frontier API calls (>$1k/month).
- Your eval suite is mature enough to measure parity gaps in the single percentage points.
- Your task is narrow enough that a 7B student can plausibly cover ~80% of it.
- You have ~3 days of engineering time and ~$200 for compute.
Not worth doing:
- You’re spending under $500/month on inference. Cheaper levers (prompt caching, smaller frontier model, batching) come first.
- Your eval is “vibes.” You can’t measure parity, so you can’t measure regression.
- The teacher’s outputs aren’t your eventual ground truth — e.g. you’re still iterating on the prompt itself.
- Your task changes weekly. Distillation freezes the teacher’s behavior at one moment in time; you’d retrain constantly.
Distillation vs other compression
- Quantization (16→4 bits): ~3× smaller, ~2× faster, near-zero quality loss. Run before distillation if memory-bound. Often combined.
- Pruning: remove weights post-hoc. Complementary to distillation; modest win in practice.
- Speculative decoding: a draft model proposes tokens, the big model verifies. Same throughput at no quality cost — different lever entirely.
- MoE (mixture of experts): more parameters, fewer activated per token. Different design choice; not a compression technique.
The pecking order most teams discover: prompt cache → batching → quantization → distillation. Each cheaper than the next; each kicks in at a different scale.
Hands-on companions
Watch it interactively:
- Distillation Lab — real GPT-2 teacher logits, learnable student, KL gradient running live in your browser. Slide
Tandα; hit play; watch the student’s distribution converge over ~50 steps. Predict before clicking: atα = 0,T = 2.0, KL falls below 0.05 in roughly 30 steps; atα = 1, KL stays high because pure hard loss never sees the dark knowledge. - Calibration Lab — temperature scaling on overconfident logits. The same
Tknob; the same softening; a different goal (calibration, not transfer). Useful for the post-distillation calibration pass. - LLM-as-Judge — the rubric tool used in the synthetic-data quality-filter step.
Build it in code:
/ship/17— synthetic data + distillation — full hands-on recipe: seed prompts → paraphrase → filter → train → measure parity → route. ~220 lines ofstack/distill.py./case-studies/05— the cheapest version of itself — applies the recipe to the docs assistant from CS-01 with real cost numbers (~5× cheaper, ~1pp eval gap after routing).
Real-world case studies
- Field report: Phi-3 — Microsoft’s published recipe for synthetic-data + distillation at frontier scale. Maps the curriculum’s vocabulary onto a real release.
- Field report: Llama 3 — iterative DPO + rejection sampling at 405B scale; what implicit distillation looks like inside a frontier post-training loop.
- Field report: DeepSeek-R1 — published response-distillation recipe for reasoning behavior into six smaller students.
See also
- LoRA & QLoRA — the cheap fine-tuning that makes the student trainable on a single GPU
- Supervised fine-tuning (SFT) — the loss being mixed in via
α · L_hard - Data & tooling — TRL, Axolotl, Unsloth (all support distillation natively)
- Stage 01 — Information theory — KL divergence and entropy
- Stage 13 — Cost & latency — distillation in the broader cost picture