When to Fine-Tune

The most common fine-tuning mistake is fine-tuning when you shouldn’t. The second-most-common is not fine-tuning when you should.

The decision flow

Before you fine-tune, try in this order:

  1. Better prompting (Stage 08). Few-shot examples, structured output, system prompts.
  2. RAG (Stage 09). If you need fresh / proprietary information.
  3. Better model. Sometimes upgrading from a 7B to a frontier model is cheaper than fine-tuning.
  4. Tool use / agents (Stage 11). For workflows that need actions.
  5. Then consider fine-tuning.

What fine-tuning is good at

  • Style and voice: making the model sound a certain way (a brand persona, a code style, a specific tone).
  • Format adherence: producing strict outputs with high reliability.
  • Domain idioms: medical/legal/finance jargon and conventions.
  • Cost reduction: a fine-tuned 7B can match a frontier model on a narrow task at 1/100th the cost.
  • Latency reduction: a fine-tuned smaller model is faster than prompting a larger one.
  • Refusal calibration: getting the model to answer or decline appropriately.
  • Tool use patterns: training models to call your specific tools well.

What fine-tuning is BAD at

  • Adding factual knowledge: fine-tuning hallucinates over old facts more than it learns new ones. Use RAG for facts.
  • Improving reasoning: small fine-tunes don’t teach reasoning; they make existing reasoning fit a format. Use a reasoning model.
  • Mediocre data → magical results: garbage in, garbage out, even at scale.
  • One-off behaviors: prompting is way cheaper for “make this output JSON.”
  • Frequently changing requirements: fine-tuning is slow; iteration is expensive.

Rule of thumb: if a frontier model with the right prompt and the right context can do it, don’t fine-tune. Fine-tuning is for the persistent gap that prompts can’t close.

Cost & data tradeoffs

Rough numbers (early 2026):

ApproachComputeDataTimeCost
Prompting only$0a few examplesminutes$0
RAG$100scorpusdayslow ongoing
LoRA fine-tuning1 GPU × hours100s–10k exampleshours$10s–$100s
Full SFT1–8 GPUs × days10k–1M examplesdays$100s–$1000s
RLHF / DPO1–8 GPUs × dayspreference pairs + reward modelweeks$1000s
Reasoning RLmany GPUs × weekstask suite + verifierweeks$10000s+
Pretrainingthousands of GPUs × monthstrillions of tokensmonthsmillions+

For most teams, the practical fine-tuning options are: LoRA SFT (cheap and small) and DPO (preference alignment). Anything bigger requires real ML infrastructure.

Scenarios mapped to techniques

You want to…Try first
Get a specific JSON shapePrompt + structured output API
Use up-to-date / private docsRAG
Reduce token cost on a high-volume taskLoRA-fine-tune a smaller model
Make outputs sound like your brandLoRA SFT on style examples
Improve a chatbot’s helpfulnessFrontier model + system prompt
Specialize for a narrow domain (medical, legal)LoRA SFT on domain data + RAG
Train a model to use your tools wellSFT on traces of correct tool use
Align with human preferencesDPO on pairwise preferences
Improve at math/code reasoningReasoning RL (hard, expensive)

Cost-benefit math

Quick check:

Per-call savings × monthly call volume × months in production
   vs.
Fine-tune cost + infra + maintenance

If you do 100M calls/month at $0.001 saving each, that’s $100k/month. A $5k fine-tune pays for itself instantly.

If you do 100k calls/month at $0.0005 saving each, that’s $50/month. A fine-tune doesn’t pay back in any reasonable timeframe.

When the data isn’t there

Most teams don’t have 10k high-quality examples for fine-tuning. Options:

  1. Synthetic data: have a strong model generate (input, output) pairs. Filter/verify carefully.
  2. Distillation: capture traces of a frontier model on your task, train a smaller model to match.
  3. Few-shot first: 100 examples as in-context prompts often beats 100 examples as fine-tune data.
  4. Active learning: deploy a baseline; collect data from real use; fine-tune on the hardest cases.

Closed vs open

  • Closed (OpenAI, Anthropic, Google fine-tuning APIs): easy, expensive, locked into vendor.
  • Open (LLaMA, Qwen, Mistral, etc.): requires GPUs but you own the model.

Closed APIs are great for SFT-style use cases when you don’t want to manage infra. Open is the path for serious customization, lower per-call cost, or data privacy reasons.

Common anti-patterns

  • “Fine-tuning will fix it” without a clear hypothesis about what’s wrong with the base model.
  • Fine-tuning on a task the frontier model already does fine — wasted effort.
  • Fine-tuning on garbage data — you get a model fluent in garbage.
  • No held-out eval — you can’t tell if you improved or just memorized.
  • One epoch when you need three (or three when you need one) — train length tuning matters.
  • Forgetting general capabilities — the model used to chat, now it only outputs JSON. Mix in some general data.

A useful mental model

Pretraining gives the model knowledge. Instruction tuning gives it manners. RLHF/DPO gives it alignment. Your fine-tune gives it a specialty.

If your “specialty” overlaps with what the manners + knowledge already cover, prompting is enough.

See also