Field report: Phi-3 — synthetic data and distillation, in the open
Field report. Observational study based on published sources. All claims cite the original paper or the official model card. Inference is marked explicitly. As of 2026-05-01.
The Phi-3 Technical Report (Microsoft, 2024) is the closest thing the field has to a public, end-to-end recipe for “small model, frontier-tier behavior.” It’s the real-world example the /articles/10-fine-tuning/distillation article is theory for.
This is a field report, not a tutorial. We map the Phi-3 paper to the curriculum’s vocabulary, mark the parts that aren’t public, and stop where the paper stops.
What was released
- Phi-3-mini (3.8B parameters), Phi-3-small (7B), Phi-3-medium (14B). MIT license. Open weights on HuggingFace.
- Phi-3 Technical Report (arxiv:2404.14219).
- Follow-up: Phi-4 (14B, December 2024) doubled down on the synthetic-data thesis.
The headline claim: Phi-3-mini at 3.8B parameters performs comparably to Mixtral 8×7B and GPT-3.5 on standard benchmarks, while being small enough to run on a phone. The thesis: data quality dominates parameter count, when the data is constructed deliberately.
What the paper actually says
Three things stand out in the paper that map directly onto the curriculum:
1. Two-stage data construction
Per Section 2 of the technical report, training data is built in two phases:
- Phase 1 — heavily filtered public web data, selected for “educational level.” The filter is itself a small classifier trained on what counts as high-quality.
- Phase 2 — synthetic data plus harder filtered web data, weighted toward reasoning, code, and math.
This is the same shape as the /ship/17 pipeline: seed → filter → synthesize → quality-gate. Microsoft did it at frontier scale with a teacher model generating “textbook-quality” examples.
2. The teacher is a frontier model
The synthetic examples are generated by larger models. The paper’s framing: a strong teacher writes pedagogical material; the student learns from material designed to be learnable. This is response distillation in our curriculum’s vocabulary — not logit distillation.
The paper does not publish the teacher model’s identity in detail. Public discussions in subsequent papers and blog posts have referenced GPT-4 and GPT-3.5 as teachers; we cite this only where Microsoft’s own writing does.
3. Quality > quantity ablations
Phi-3-mini trained on 3.3T tokens — much less than the Chinchilla-optimal compute for a 3.8B model, which would be ~76B tokens. The paper argues curated data with more passes outperforms uncurated data with the “right” token count. The ablation tables back this up; specifically, Section 3 reports Phi-3-mini reaching MMLU and HumanEval scores in the same range as much larger contemporary models trained on much more data.
The recipe in curriculum language
Mapped onto /articles/10-fine-tuning/distillation and /ship/17:
| Step in the curriculum | Phi-3 equivalent |
|---|---|
| Seed prompts | Curriculum-aligned topics (math, code, reasoning, common-sense) |
| Paraphrase via teacher | Synthetic example generation by a frontier model |
| Quality filter | Rubric-graded by the same teacher; keep high-scoring only |
| Dedupe + eval contamination check | Done; specifics not public |
| SFT student | Standard transformer training with the curated mix |
| Soft KL distillation | Not reported. Phi-3 is response distillation only, no logit transfer. |
| Production routing | Out of scope for the paper |
The biggest delta from the curriculum: Phi-3 uses response distillation, not logit distillation. They’re training the student on text the teacher generated, not on the teacher’s softmax distribution. That’s cheaper at training time, can use closed-API teachers, and matches what most teams realistically do.
Reproducibility status
- Compute. Phi-3-mini training is reported at 3.3T tokens on a 3.8B model. Public estimates from independent groups put this at roughly 1500–2500 H100-days (the paper does not state the cluster size, so we describe the order of magnitude only).
- Data. Filtered web data and synthetic data, neither of which is released. A team attempting replication would need to construct their own.
- Tooling. Standard HuggingFace stack. Nothing exotic.
- Realistic for a frontier lab? Yes — done.
- Realistic for a well-funded startup? Partially. Generating 100Bs of synthetic tokens via teacher-model APIs costs millions; the compute is in the same range. A startup could plausibly replicate Phi-3-mini’s post-training stage on a Llama-3 base for far less.
- Realistic for an academic group? Not at the full scale. The Phi-3 recipe (small model, curated synthetic, careful filtering) is widely reproducible at smaller scales — see
/case-studies/05for a docs-assistant-shaped version. - Realistic for a hobbyist? No, except as inspiration. The /ship/17 walkthrough scales the idea down to something feasible.
What’s still confidential
The paper is unusually detailed for a frontier-lab release, but several things remain unstated:
- The exact prompting templates used to generate synthetic examples.
- The topic taxonomy that drove generation (how the curriculum was chunked into seed prompts).
- The quality-filter rubric — what specific scores were thresholds, what dimensions were graded.
- Full training-data composition by category (the paper gives broad categories, not a fine-grained mix).
- Hyperparameters for some stages.
- The teacher model version used (e.g. GPT-4 vs GPT-4-Turbo; Microsoft does not commit in print).
These gaps are why Phi-3 is a field report, not a recipe. The paper teaches the shape; the implementation details remain proprietary.
What’s changed since
- Phi-4 (December 2024) — 14B params, paper emphasizes ~70% synthetic data. Validates the Phi-3 thesis at larger scale.
- Open replications — several open-weight models (notably some Qwen and Llama-3 derivatives) have explicitly cited Phi-3’s data-construction approach.
- The “synthetic data dominates” thesis is now the consensus view for small-model post-training. Phi-3 was the inflection point.
What this teaches you
Read this back-to-back with /articles/10-fine-tuning/distillation and /ship/17:
- The synthetic-data + distillation pattern from /ship/17 is not a toy version of how real labs work — it’s the same pattern, scaled.
- The choice of response distillation vs logit distillation is the practical fork. Phi-3 picked response distillation because their teacher was an API-only frontier model. /ship/17 teaches both because either may be available to you.
- The “boring” steps — filtering, dedupe, contamination checks — are the steps Microsoft puts the most paper-pages on. That maps to the curriculum’s claim that those are the highest-ROI moves.
- Quality > quantity is now defensible from public data, not just folklore. Phi-3’s MMLU per training-token is a multiple of the same era’s web-trained models. The receipts are in the paper.
Further reading
Books move slower than papers in this field; treat these as the foundations under the Phi-3 paper, not replacements for it.
- “AI Engineering” by Chip Huyen (O’Reilly, 2024) — the most current production-AI book at this writing. Covers distillation, synthetic data, and evaluation as engineering disciplines. Read alongside
/ship/17. - “Hands-On Large Language Models” by Jay Alammar and Maarten Grootendorst (O’Reilly, 2024) — visual, practical, builds working systems chapter by chapter. The data-pipeline chapters are directly relevant to Phi-3’s two-phase construction.
- “Natural Language Processing with Transformers” by Lewis Tunstall, Leandro von Werra, and Thomas Wolf (O’Reilly, revised 2023) — the HuggingFace book. The SFT and distillation chapters use the same APIs Phi-3 was built on.
- “Build a Large Language Model From Scratch” by Sebastian Raschka (Manning, 2024) — for the foundations under all of this. Won’t cover Phi-3 directly, but you’ll understand what’s being trained when the paper says “decoder-only transformer.”
See also
- Distillation — the curriculum article this case study extends
/ship/17— synthetic data + distillation — the hands-on pipeline/case-studies/05— the cheapest version of itself — a small-team version of the same playbook- Stage 07 — Scaling laws — Chinchilla, and why Phi-3 deliberately violates it
- Stage 13 — Cost & latency — why making a small model competitive matters in production