Supervised Fine-Tuning (SFT)

The simplest form of fine-tuning. Take a pretrained model, train it on (input, output) pairs with the same next-token prediction objective. The model learns to produce your desired outputs.

The training objective

For each example (prompt, response):

Loss = − Σ_t log P(response_token_t | prompt + response[<t])

You typically only compute loss on the response tokens, not the prompt — sometimes called “completion-only loss.” This focuses learning on the part you care about.

Data format

The standard structure: a list of conversations.

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
  ]
}

Different libraries expect slightly different formats; most align on this OpenAI-compatible shape.

How much data

Folk numbers, holding up reasonably well:

50–500 examples: enough for style/format adjustments via LoRA.
1k–10k examples: enough for task specialization.
10k–100k examples: enough for substantial behavior change, full SFT viable.
100k+: more is better, with diminishing returns past ~1M for narrow tasks.

Quality matters more than quantity. A curated 1k beats a noisy 100k.

Data quality

What “high-quality SFT data” means:

Correctness: every output is what you’d want the model to produce.
Diversity: covers the input distribution you care about — not 1000 variations of the same query.
Calibrated difficulty: includes easy and hard cases.
Clean format: consistent, free of placeholder text, no broken markdown.
Refusals where appropriate: cases where the model should decline.
Edge cases: empty inputs, ambiguous inputs, adversarial inputs.

A typical fine-tune pipeline spends 80% of effort on data and 20% on training.

Training loop in TRL (HuggingFace)

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("your_data", split="train")

config = SFTConfig(
    output_dir="./out",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_ratio=0.03,
    bf16=True,
    logging_steps=10,
)

trainer = SFTTrainer(
    model="meta-llama/Llama-3.1-8B-Instruct",
    train_dataset=dataset,
    args=config,
)
trainer.train()

That’s roughly the recipe for full SFT. Replace with LoRA configs if you don’t have the GPUs.

Hyperparameters that matter

Learning rate: 1e-5 to 5e-5 for full fine-tuning, often higher for LoRA (1e-4 to 5e-4). Smaller for larger models.
Epochs: 1–5. More than 5 usually overfits. For tiny datasets, 5–10 may be needed.
Batch size: as large as fits, typically 4–32 with gradient accumulation for effective batch of 64–256.
Warmup: 3–10% of total steps.
LR schedule: cosine decay, ending at ~10% of peak.
Weight decay: 0.01–0.1.
Gradient clipping: 1.0.
Max sequence length: as long as you need; longer = more memory.

Choosing the base model

For SFT in 2026:

Open-weights, English: LLaMA-3.x family, Qwen3, Mistral, Phi-4.
Multilingual: Qwen3, Gemma 2/3, multilingual LLaMA variants.
Code: Qwen2.5-Coder, DeepSeek-Coder-V2, Codestral.
Already chat-tuned: start from *-Instruct versions for chat-style fine-tunes.
Reasoning seed: distilled R1 / o1 / Claude reasoning checkpoints if you want CoT-strong students.

Important: start from an instruction-tuned model unless you’re doing instruction-tuning yourself from scratch (rare). Pretraining-only checkpoints don’t follow chat formatting; you’d waste effort re-teaching it.

Catastrophic forgetting

Fine-tune naively and the model forgets general capabilities — it suddenly fails at things it used to do.

Mitigations:

Low learning rate: smaller updates = less forgetting.
LoRA: only train a small adapter; base weights unchanged.
Data mixing: include some general-purpose data (e.g. open-instruct) alongside your task data.
Multi-task training: train on several tasks at once.
Regularization: KL penalty against the base model (essentially what RLHF does — see Stage 10’s RLHF article).
Replay buffer of original instruction-tuning data.

Evaluating SFT

Before and after the fine-tune, evaluate on:

Your target task: your golden eval set.
General capabilities: a small benchmark (e.g. MMLU subset, IFEval) to check for regression.
Safety: jailbreak resistance, refusal rates on harmful prompts.
Calibration: does the model correctly say “I don’t know” where it should?

Expect: target task improves, general capabilities flat or slightly down. If general capabilities crater, you’re overfitting.

Iteration loop

Start with a small dataset (100 examples).
Train, eval.
Inspect failures. Add 50–200 examples targeting them.
Train again. Eval.
Repeat until target metric plateaus or you ship.

Don’t go straight to a 10k-example run. Iterate small first.

Continued pretraining vs SFT

Sometimes you want to inject domain knowledge rather than instruct behavior. Continued pretraining (also: “domain-adaptive pretraining”) trains on raw domain text without instruction format.

Pretrained model → continued pretraining on medical text → SFT on medical Q&A

Useful for very specialized vocabulary domains (legal, biomedical, code). Heavier than SFT; mix some general text to avoid forgetting.

Multi-task SFT

If your model has multiple downstream uses, train on multiple tasks together:

[(question, answer), (summary input, summary output), (code input, code output), ...]

Often produces a more general assistant that’s competent on each task without separate models.

Pitfalls

Training only on positive examples — the model never learns to refuse, never learns from corrections.
Inconsistent formatting — the model picks up the inconsistency.
Leaked test data in training — looks great in eval, fails in production.
Wrong special tokens — chat templates differ between models. Get them right.
Loss not masked correctly — training on prompts wastes compute and can hurt quality.
Training too long — overfitting, loss of general capabilities.