Supervised Fine-Tuning (SFT)
The simplest form of fine-tuning. Take a pretrained model, train it on (input, output) pairs with the same next-token prediction objective. The model learns to produce your desired outputs.
The training objective
For each example (prompt, response):
Loss = − Σ_t log P(response_token_t | prompt + response[<t])
You typically only compute loss on the response tokens, not the prompt — sometimes called “completion-only loss.” This focuses learning on the part you care about.
Data format
The standard structure: a list of conversations.
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."}
]
}
Different libraries expect slightly different formats; most align on this OpenAI-compatible shape.
How much data
Folk numbers, holding up reasonably well:
- 50–500 examples: enough for style/format adjustments via LoRA.
- 1k–10k examples: enough for task specialization.
- 10k–100k examples: enough for substantial behavior change, full SFT viable.
- 100k+: more is better, with diminishing returns past ~1M for narrow tasks.
Quality matters more than quantity. A curated 1k beats a noisy 100k.
Data quality
What “high-quality SFT data” means:
- Correctness: every output is what you’d want the model to produce.
- Diversity: covers the input distribution you care about — not 1000 variations of the same query.
- Calibrated difficulty: includes easy and hard cases.
- Clean format: consistent, free of placeholder text, no broken markdown.
- Refusals where appropriate: cases where the model should decline.
- Edge cases: empty inputs, ambiguous inputs, adversarial inputs.
A typical fine-tune pipeline spends 80% of effort on data and 20% on training.
Training loop in TRL (HuggingFace)
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("your_data", split="train")
config = SFTConfig(
output_dir="./out",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
warmup_ratio=0.03,
bf16=True,
logging_steps=10,
)
trainer = SFTTrainer(
model="meta-llama/Llama-3.1-8B-Instruct",
train_dataset=dataset,
args=config,
)
trainer.train()
That’s roughly the recipe for full SFT. Replace with LoRA configs if you don’t have the GPUs.
Hyperparameters that matter
- Learning rate: 1e-5 to 5e-5 for full fine-tuning, often higher for LoRA (1e-4 to 5e-4). Smaller for larger models.
- Epochs: 1–5. More than 5 usually overfits. For tiny datasets, 5–10 may be needed.
- Batch size: as large as fits, typically 4–32 with gradient accumulation for effective batch of 64–256.
- Warmup: 3–10% of total steps.
- LR schedule: cosine decay, ending at ~10% of peak.
- Weight decay: 0.01–0.1.
- Gradient clipping: 1.0.
- Max sequence length: as long as you need; longer = more memory.
Choosing the base model
For SFT in 2026:
- Open-weights, English: LLaMA-3.x family, Qwen3, Mistral, Phi-4.
- Multilingual: Qwen3, Gemma 2/3, multilingual LLaMA variants.
- Code: Qwen2.5-Coder, DeepSeek-Coder-V2, Codestral.
- Already chat-tuned: start from
*-Instructversions for chat-style fine-tunes. - Reasoning seed: distilled R1 / o1 / Claude reasoning checkpoints if you want CoT-strong students.
Important: start from an instruction-tuned model unless you’re doing instruction-tuning yourself from scratch (rare). Pretraining-only checkpoints don’t follow chat formatting; you’d waste effort re-teaching it.
Catastrophic forgetting
Fine-tune naively and the model forgets general capabilities — it suddenly fails at things it used to do.
Mitigations:
- Low learning rate: smaller updates = less forgetting.
- LoRA: only train a small adapter; base weights unchanged.
- Data mixing: include some general-purpose data (e.g. open-instruct) alongside your task data.
- Multi-task training: train on several tasks at once.
- Regularization: KL penalty against the base model (essentially what RLHF does — see Stage 10’s RLHF article).
- Replay buffer of original instruction-tuning data.
Evaluating SFT
Before and after the fine-tune, evaluate on:
- Your target task: your golden eval set.
- General capabilities: a small benchmark (e.g. MMLU subset, IFEval) to check for regression.
- Safety: jailbreak resistance, refusal rates on harmful prompts.
- Calibration: does the model correctly say “I don’t know” where it should?
Expect: target task improves, general capabilities flat or slightly down. If general capabilities crater, you’re overfitting.
Iteration loop
- Start with a small dataset (100 examples).
- Train, eval.
- Inspect failures. Add 50–200 examples targeting them.
- Train again. Eval.
- Repeat until target metric plateaus or you ship.
Don’t go straight to a 10k-example run. Iterate small first.
Continued pretraining vs SFT
Sometimes you want to inject domain knowledge rather than instruct behavior. Continued pretraining (also: “domain-adaptive pretraining”) trains on raw domain text without instruction format.
Pretrained model → continued pretraining on medical text → SFT on medical Q&A
Useful for very specialized vocabulary domains (legal, biomedical, code). Heavier than SFT; mix some general text to avoid forgetting.
Multi-task SFT
If your model has multiple downstream uses, train on multiple tasks together:
[(question, answer), (summary input, summary output), (code input, code output), ...]
Often produces a more general assistant that’s competent on each task without separate models.
Pitfalls
- Training only on positive examples — the model never learns to refuse, never learns from corrections.
- Inconsistent formatting — the model picks up the inconsistency.
- Leaked test data in training — looks great in eval, fails in production.
- Wrong special tokens — chat templates differ between models. Get them right.
- Loss not masked correctly — training on prompts wastes compute and can hurt quality.
- Training too long — overfitting, loss of general capabilities.