Supervised Fine-Tuning (SFT)

The simplest form of fine-tuning. Take a pretrained model, train it on (input, output) pairs with the same next-token prediction objective. The model learns to produce your desired outputs.

The training objective

For each example (prompt, response):

Loss = − Σ_t log P(response_token_t | prompt + response[<t])

You typically only compute loss on the response tokens, not the prompt — sometimes called “completion-only loss.” This focuses learning on the part you care about.

Data format

The standard structure: a list of conversations.

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."}
  ]
}

Different libraries expect slightly different formats; most align on this OpenAI-compatible shape.

How much data

Folk numbers, holding up reasonably well:

  • 50–500 examples: enough for style/format adjustments via LoRA.
  • 1k–10k examples: enough for task specialization.
  • 10k–100k examples: enough for substantial behavior change, full SFT viable.
  • 100k+: more is better, with diminishing returns past ~1M for narrow tasks.

Quality matters more than quantity. A curated 1k beats a noisy 100k.

Data quality

What “high-quality SFT data” means:

  • Correctness: every output is what you’d want the model to produce.
  • Diversity: covers the input distribution you care about — not 1000 variations of the same query.
  • Calibrated difficulty: includes easy and hard cases.
  • Clean format: consistent, free of placeholder text, no broken markdown.
  • Refusals where appropriate: cases where the model should decline.
  • Edge cases: empty inputs, ambiguous inputs, adversarial inputs.

A typical fine-tune pipeline spends 80% of effort on data and 20% on training.

Training loop in TRL (HuggingFace)

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("your_data", split="train")

config = SFTConfig(
    output_dir="./out",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    warmup_ratio=0.03,
    bf16=True,
    logging_steps=10,
)

trainer = SFTTrainer(
    model="meta-llama/Llama-3.1-8B-Instruct",
    train_dataset=dataset,
    args=config,
)
trainer.train()

That’s roughly the recipe for full SFT. Replace with LoRA configs if you don’t have the GPUs.

Hyperparameters that matter

  • Learning rate: 1e-5 to 5e-5 for full fine-tuning, often higher for LoRA (1e-4 to 5e-4). Smaller for larger models.
  • Epochs: 1–5. More than 5 usually overfits. For tiny datasets, 5–10 may be needed.
  • Batch size: as large as fits, typically 4–32 with gradient accumulation for effective batch of 64–256.
  • Warmup: 3–10% of total steps.
  • LR schedule: cosine decay, ending at ~10% of peak.
  • Weight decay: 0.01–0.1.
  • Gradient clipping: 1.0.
  • Max sequence length: as long as you need; longer = more memory.

Choosing the base model

For SFT in 2026:

  • Open-weights, English: LLaMA-3.x family, Qwen3, Mistral, Phi-4.
  • Multilingual: Qwen3, Gemma 2/3, multilingual LLaMA variants.
  • Code: Qwen2.5-Coder, DeepSeek-Coder-V2, Codestral.
  • Already chat-tuned: start from *-Instruct versions for chat-style fine-tunes.
  • Reasoning seed: distilled R1 / o1 / Claude reasoning checkpoints if you want CoT-strong students.

Important: start from an instruction-tuned model unless you’re doing instruction-tuning yourself from scratch (rare). Pretraining-only checkpoints don’t follow chat formatting; you’d waste effort re-teaching it.

Catastrophic forgetting

Fine-tune naively and the model forgets general capabilities — it suddenly fails at things it used to do.

Mitigations:

  • Low learning rate: smaller updates = less forgetting.
  • LoRA: only train a small adapter; base weights unchanged.
  • Data mixing: include some general-purpose data (e.g. open-instruct) alongside your task data.
  • Multi-task training: train on several tasks at once.
  • Regularization: KL penalty against the base model (essentially what RLHF does — see Stage 10’s RLHF article).
  • Replay buffer of original instruction-tuning data.

Evaluating SFT

Before and after the fine-tune, evaluate on:

  • Your target task: your golden eval set.
  • General capabilities: a small benchmark (e.g. MMLU subset, IFEval) to check for regression.
  • Safety: jailbreak resistance, refusal rates on harmful prompts.
  • Calibration: does the model correctly say “I don’t know” where it should?

Expect: target task improves, general capabilities flat or slightly down. If general capabilities crater, you’re overfitting.

Iteration loop

  1. Start with a small dataset (100 examples).
  2. Train, eval.
  3. Inspect failures. Add 50–200 examples targeting them.
  4. Train again. Eval.
  5. Repeat until target metric plateaus or you ship.

Don’t go straight to a 10k-example run. Iterate small first.

Continued pretraining vs SFT

Sometimes you want to inject domain knowledge rather than instruct behavior. Continued pretraining (also: “domain-adaptive pretraining”) trains on raw domain text without instruction format.

Pretrained model → continued pretraining on medical text → SFT on medical Q&A

Useful for very specialized vocabulary domains (legal, biomedical, code). Heavier than SFT; mix some general text to avoid forgetting.

Multi-task SFT

If your model has multiple downstream uses, train on multiple tasks together:

[(question, answer), (summary input, summary output), (code input, code output), ...]

Often produces a more general assistant that’s competent on each task without separate models.

Pitfalls

  • Training only on positive examples — the model never learns to refuse, never learns from corrections.
  • Inconsistent formatting — the model picks up the inconsistency.
  • Leaked test data in training — looks great in eval, fails in production.
  • Wrong special tokens — chat templates differ between models. Get them right.
  • Loss not masked correctly — training on prompts wastes compute and can hurt quality.
  • Training too long — overfitting, loss of general capabilities.

See also