When to Fine-Tune

The most common fine-tuning mistake is fine-tuning when you shouldn’t. The second-most-common is not fine-tuning when you should.

The decision flow

Before you fine-tune, try in this order:

Better prompting (Stage 08). Few-shot examples, structured output, system prompts.
RAG (Stage 09). If you need fresh / proprietary information.
Better model. Sometimes upgrading from a 7B to a frontier model is cheaper than fine-tuning.
Tool use / agents (Stage 11). For workflows that need actions.
Then consider fine-tuning.

What fine-tuning is good at

Style and voice: making the model sound a certain way (a brand persona, a code style, a specific tone).
Format adherence: producing strict outputs with high reliability.
Domain idioms: medical/legal/finance jargon and conventions.
Cost reduction: a fine-tuned 7B can match a frontier model on a narrow task at 1/100th the cost.
Latency reduction: a fine-tuned smaller model is faster than prompting a larger one.
Refusal calibration: getting the model to answer or decline appropriately.
Tool use patterns: training models to call your specific tools well.

What fine-tuning is BAD at

Adding factual knowledge: fine-tuning hallucinates over old facts more than it learns new ones. Use RAG for facts.
Improving reasoning: small fine-tunes don’t teach reasoning; they make existing reasoning fit a format. Use a reasoning model.
Mediocre data → magical results: garbage in, garbage out, even at scale.
One-off behaviors: prompting is way cheaper for “make this output JSON.”
Frequently changing requirements: fine-tuning is slow; iteration is expensive.

Rule of thumb: if a frontier model with the right prompt and the right context can do it, don’t fine-tune. Fine-tuning is for the persistent gap that prompts can’t close.

Cost & data tradeoffs

Rough numbers (early 2026):

Approach	Compute	Data	Time	Cost
Prompting only	$0	a few examples	minutes	$0
RAG	$100s	corpus	days	low ongoing
LoRA fine-tuning	1 GPU × hours	100s–10k examples	hours	$10s–$100s
Full SFT	1–8 GPUs × days	10k–1M examples	days	$100s–$1000s
RLHF / DPO	1–8 GPUs × days	preference pairs + reward model	weeks	$1000s
Reasoning RL	many GPUs × weeks	task suite + verifier	weeks	$10000s+
Pretraining	thousands of GPUs × months	trillions of tokens	months	millions+

For most teams, the practical fine-tuning options are: LoRA SFT (cheap and small) and DPO (preference alignment). Anything bigger requires real ML infrastructure.

Scenarios mapped to techniques

You want to…	Try first
Get a specific JSON shape	Prompt + structured output API
Use up-to-date / private docs	RAG
Reduce token cost on a high-volume task	LoRA-fine-tune a smaller model
Make outputs sound like your brand	LoRA SFT on style examples
Improve a chatbot’s helpfulness	Frontier model + system prompt
Specialize for a narrow domain (medical, legal)	LoRA SFT on domain data + RAG
Train a model to use your tools well	SFT on traces of correct tool use
Align with human preferences	DPO on pairwise preferences
Improve at math/code reasoning	Reasoning RL (hard, expensive)

Cost-benefit math

Quick check:

Per-call savings × monthly call volume × months in production
   vs.
Fine-tune cost + infra + maintenance

If you do 100M calls/month at $0.001 saving each, that’s $100k/month. A $5k fine-tune pays for itself instantly.

If you do 100k calls/month at $0.0005 saving each, that’s $50/month. A fine-tune doesn’t pay back in any reasonable timeframe.

When the data isn’t there

Most teams don’t have 10k high-quality examples for fine-tuning. Options:

Synthetic data: have a strong model generate (input, output) pairs. Filter/verify carefully.
Distillation: capture traces of a frontier model on your task, train a smaller model to match.
Few-shot first: 100 examples as in-context prompts often beats 100 examples as fine-tune data.
Active learning: deploy a baseline; collect data from real use; fine-tune on the hardest cases.

Closed vs open

Closed (OpenAI, Anthropic, Google fine-tuning APIs): easy, expensive, locked into vendor.
Open (LLaMA, Qwen, Mistral, etc.): requires GPUs but you own the model.

Closed APIs are great for SFT-style use cases when you don’t want to manage infra. Open is the path for serious customization, lower per-call cost, or data privacy reasons.

Common anti-patterns

“Fine-tuning will fix it” without a clear hypothesis about what’s wrong with the base model.
Fine-tuning on a task the frontier model already does fine — wasted effort.
Fine-tuning on garbage data — you get a model fluent in garbage.
No held-out eval — you can’t tell if you improved or just memorized.
One epoch when you need three (or three when you need one) — train length tuning matters.
Forgetting general capabilities — the model used to chat, now it only outputs JSON. Mix in some general data.

A useful mental model

Pretraining gives the model knowledge. Instruction tuning gives it manners. RLHF/DPO gives it alignment. Your fine-tune gives it a specialty.

If your “specialty” overlaps with what the manners + knowledge already cover, prompting is enough.