Stage 10 — Fine-Tuning
When prompting and RAG aren’t enough, you change the model itself. Fine-tuning takes a pretrained model and adapts it to your task, your domain, your style — by training on more data.
In 2026, fine-tuning is less universal than it was in 2022. Frontier prompting + RAG covers many cases that used to require fine-tuning. But for the cases that remain, fine-tuning is irreplaceable.
Prerequisites
- Stage 03 (NN training, optimizers)
- Stage 06 (transformer architecture)
- Stage 08 (prompting)
Learning ladder
- When to fine-tune — decision flow vs prompting and RAG
- Supervised fine-tuning (SFT) — the foundation
- LoRA & QLoRA — parameter-efficient training
- RLHF, DPO, GRPO — preference and reward-based training
- Distillation — copy a teacher’s behavior into a small student
- Embedding fine-tuning
- Data & tooling — TRL, Axolotl, Unsloth, dataset design
MVU
You can:
- Decide when fine-tuning is the right tool (and when it isn’t)
- Pick between SFT, LoRA, DPO, GRPO based on use case and resources
- Estimate the data and compute needed for a given fine-tune
- Avoid the most common pitfall: fine-tuning on data your model already does fine on
Exercise
LoRA-fine-tune a 7B model on a 1k-example instruction dataset of your design. Evaluate against the base model on a held-out test set. Aim for measurable improvement on your target task without regression on general capabilities.
Field reports — real-world case studies
Observational write-ups based on published papers, with strict citation and explicit “what’s still confidential” sections. Each maps a curriculum article to a real frontier-lab artifact.
- Field report: Phi-3 — synthetic data + distillation, in the open. Anchors
distillation.mdand/ship/17to a real Microsoft release. - Field report: Llama 3 — Meta’s 92-page post-training paper, distilled to what curriculum readers should pay attention to. Anchors
rlhf-dpo-grpo.mdand the iterative-DPO pattern. - Field report: DeepSeek-R1 — pure-RL reasoning training and the published distillation recipe. Anchors
/articles/07-modern-llms/reasoning-models.md.
Hands-on companions
Watch it interactively:
- LoRA Lab — drag the rank slider; watch the singular-value spectrum reveal why low-rank matters. Real Jacobi SVD on a synthetic ΔW.
- RLHF Lab — your A/B picks fit a Bradley-Terry reward model in real time. Real gradient descent runs on the labels you click; weights update; loss curve plotted.
- Distillation Lab — real GPT-2 teacher logits, learnable student, KL gradient running live. Slide
Tandα; watch the student’s distribution converge. - Quantization Lab — slide bits from 16 → 2; watch RMSE rise and memory drop. Foundational for QLoRA’s 4-bit base.
Build it in code:
/build/12— LoRA-fine-tune your tiny GPT — from frozen weights to a working LoRA training loop in ~50 lines./ship/16— what’s next — honest take on when LoRA pays off in production vs when prompting wins./ship/17— synthetic data + distillation — full hands-on distillation pipeline; ~220 lines ofstack/distill.py./case-studies/05— the cheapest version of itself — distillation applied end-to-end to the docs assistant.
See also
- Stage 09 — RAG — the alternative for many use cases
- Stage 13 — Production — serving your fine-tuned model




