Field report: Phi-3 — synthetic data and distillation, in the open

Field report. Observational study based on published sources. All claims cite the original paper or the official model card. Inference is marked explicitly. As of 2026-05-01.

The Phi-3 Technical Report (Microsoft, 2024) is the closest thing the field has to a public, end-to-end recipe for “small model, frontier-tier behavior.” It’s the real-world example the /articles/10-fine-tuning/distillation article is theory for.

This is a field report, not a tutorial. We map the Phi-3 paper to the curriculum’s vocabulary, mark the parts that aren’t public, and stop where the paper stops.

What was released

Phi-3-mini (3.8B parameters), Phi-3-small (7B), Phi-3-medium (14B). MIT license. Open weights on HuggingFace.
Phi-3 Technical Report (arxiv:2404.14219).
Follow-up: Phi-4 (14B, December 2024) doubled down on the synthetic-data thesis.

The headline claim: Phi-3-mini at 3.8B parameters performs comparably to Mixtral 8×7B and GPT-3.5 on standard benchmarks, while being small enough to run on a phone. The thesis: data quality dominates parameter count, when the data is constructed deliberately.

What the paper actually says

Three things stand out in the paper that map directly onto the curriculum:

1. Two-stage data construction

Per Section 2 of the technical report, training data is built in two phases:

Phase 1 — heavily filtered public web data, selected for “educational level.” The filter is itself a small classifier trained on what counts as high-quality.
Phase 2 — synthetic data plus harder filtered web data, weighted toward reasoning, code, and math.

This is the same shape as the /ship/17 pipeline: seed → filter → synthesize → quality-gate. Microsoft did it at frontier scale with a teacher model generating “textbook-quality” examples.

2. The teacher is a frontier model

The synthetic examples are generated by larger models. The paper’s framing: a strong teacher writes pedagogical material; the student learns from material designed to be learnable. This is response distillation in our curriculum’s vocabulary — not logit distillation.

The paper does not publish the teacher model’s identity in detail. Public discussions in subsequent papers and blog posts have referenced GPT-4 and GPT-3.5 as teachers; we cite this only where Microsoft’s own writing does.

3. Quality > quantity ablations

Phi-3-mini trained on 3.3T tokens — much less than the Chinchilla-optimal compute for a 3.8B model, which would be ~76B tokens. The paper argues curated data with more passes outperforms uncurated data with the “right” token count. The ablation tables back this up; specifically, Section 3 reports Phi-3-mini reaching MMLU and HumanEval scores in the same range as much larger contemporary models trained on much more data.

The recipe in curriculum language

Mapped onto /articles/10-fine-tuning/distillation and /ship/17:

Step in the curriculum	Phi-3 equivalent
Seed prompts	Curriculum-aligned topics (math, code, reasoning, common-sense)
Paraphrase via teacher	Synthetic example generation by a frontier model
Quality filter	Rubric-graded by the same teacher; keep high-scoring only
Dedupe + eval contamination check	Done; specifics not public
SFT student	Standard transformer training with the curated mix
Soft KL distillation	Not reported. Phi-3 is response distillation only, no logit transfer.
Production routing	Out of scope for the paper

The biggest delta from the curriculum: Phi-3 uses response distillation, not logit distillation. They’re training the student on text the teacher generated, not on the teacher’s softmax distribution. That’s cheaper at training time, can use closed-API teachers, and matches what most teams realistically do.

Reproducibility status

Compute. Phi-3-mini training is reported at 3.3T tokens on a 3.8B model. Public estimates from independent groups put this at roughly 1500–2500 H100-days (the paper does not state the cluster size, so we describe the order of magnitude only).
Data. Filtered web data and synthetic data, neither of which is released. A team attempting replication would need to construct their own.
Tooling. Standard HuggingFace stack. Nothing exotic.
Realistic for a frontier lab? Yes — done.
Realistic for a well-funded startup? Partially. Generating 100Bs of synthetic tokens via teacher-model APIs costs millions; the compute is in the same range. A startup could plausibly replicate Phi-3-mini’s post-training stage on a Llama-3 base for far less.
Realistic for an academic group? Not at the full scale. The Phi-3 recipe (small model, curated synthetic, careful filtering) is widely reproducible at smaller scales — see /case-studies/05 for a docs-assistant-shaped version.
Realistic for a hobbyist? No, except as inspiration. The /ship/17 walkthrough scales the idea down to something feasible.

What’s still confidential

The paper is unusually detailed for a frontier-lab release, but several things remain unstated:

The exact prompting templates used to generate synthetic examples.
The topic taxonomy that drove generation (how the curriculum was chunked into seed prompts).
The quality-filter rubric — what specific scores were thresholds, what dimensions were graded.
Full training-data composition by category (the paper gives broad categories, not a fine-grained mix).
Hyperparameters for some stages.
The teacher model version used (e.g. GPT-4 vs GPT-4-Turbo; Microsoft does not commit in print).

These gaps are why Phi-3 is a field report, not a recipe. The paper teaches the shape; the implementation details remain proprietary.

What’s changed since

Phi-4 (December 2024) — 14B params, paper emphasizes ~70% synthetic data. Validates the Phi-3 thesis at larger scale.
Open replications — several open-weight models (notably some Qwen and Llama-3 derivatives) have explicitly cited Phi-3’s data-construction approach.
The “synthetic data dominates” thesis is now the consensus view for small-model post-training. Phi-3 was the inflection point.

What this teaches you

Read this back-to-back with /articles/10-fine-tuning/distillation and /ship/17:

The synthetic-data + distillation pattern from /ship/17 is not a toy version of how real labs work — it’s the same pattern, scaled.
The choice of response distillation vs logit distillation is the practical fork. Phi-3 picked response distillation because their teacher was an API-only frontier model. /ship/17 teaches both because either may be available to you.
The “boring” steps — filtering, dedupe, contamination checks — are the steps Microsoft puts the most paper-pages on. That maps to the curriculum’s claim that those are the highest-ROI moves.
Quality > quantity is now defensible from public data, not just folklore. Phi-3’s MMLU per training-token is a multiple of the same era’s web-trained models. The receipts are in the paper.