demo
A small student learning a frontier teacher
Real GPT-2 logits as the teacher. A randomly-initialized student. Slide temperature, mix the hard and soft losses, hit play, and watch a real gradient descent on KL divergence pull the student's distribution toward the teacher's. The dark-knowledge transfer that makes a 1B model behave like a frontier one — visible.
Anchored to 10-fine-tuning/distillation
and the /ship/17 — synthetic data + distillation walkthrough.