Stage 10 — Fine-Tuning

When prompting and RAG aren’t enough, you change the model itself. Fine-tuning takes a pretrained model and adapts it to your task, your domain, your style — by training on more data.

In 2026, fine-tuning is less universal than it was in 2022. Frontier prompting + RAG covers many cases that used to require fine-tuning. But for the cases that remain, fine-tuning is irreplaceable.

Prerequisites

Stage 03 (NN training, optimizers)
Stage 06 (transformer architecture)
Stage 08 (prompting)

Learning ladder

When to fine-tune — decision flow vs prompting and RAG
Supervised fine-tuning (SFT) — the foundation
LoRA & QLoRA — parameter-efficient training
RLHF, DPO, GRPO — preference and reward-based training
Distillation — copy a teacher’s behavior into a small student
Embedding fine-tuning
Data & tooling — TRL, Axolotl, Unsloth, dataset design

MVU

You can:

Decide when fine-tuning is the right tool (and when it isn’t)
Pick between SFT, LoRA, DPO, GRPO based on use case and resources
Estimate the data and compute needed for a given fine-tune
Avoid the most common pitfall: fine-tuning on data your model already does fine on

Exercise

LoRA-fine-tune a 7B model on a 1k-example instruction dataset of your design. Evaluate against the base model on a held-out test set. Aim for measurable improvement on your target task without regression on general capabilities.

Field reports — real-world case studies

Observational write-ups based on published papers, with strict citation and explicit “what’s still confidential” sections. Each maps a curriculum article to a real frontier-lab artifact.

Field report: Phi-3 — synthetic data + distillation, in the open. Anchors distillation.md and /ship/17 to a real Microsoft release.
Field report: Llama 3 — Meta’s 92-page post-training paper, distilled to what curriculum readers should pay attention to. Anchors rlhf-dpo-grpo.md and the iterative-DPO pattern.
Field report: DeepSeek-R1 — pure-RL reasoning training and the published distillation recipe. Anchors /articles/07-modern-llms/reasoning-models.md.

Hands-on companions

Watch it interactively:

LoRA Lab — drag the rank slider; watch the singular-value spectrum reveal why low-rank matters. Real Jacobi SVD on a synthetic ΔW.
RLHF Lab — your A/B picks fit a Bradley-Terry reward model in real time. Real gradient descent runs on the labels you click; weights update; loss curve plotted.
Distillation Lab — real GPT-2 teacher logits, learnable student, KL gradient running live. Slide T and α; watch the student’s distribution converge.
Quantization Lab — slide bits from 16 → 2; watch RMSE rise and memory drop. Foundational for QLoRA’s 4-bit base.

Build it in code:

/build/12 — LoRA-fine-tune your tiny GPT — from frozen weights to a working LoRA training loop in ~50 lines.
/ship/16 — what’s next — honest take on when LoRA pays off in production vs when prompting wins.
/ship/17 — synthetic data + distillation — full hands-on distillation pipeline; ~220 lines of stack/distill.py.
/case-studies/05 — the cheapest version of itself — distillation applied end-to-end to the docs assistant.