When to Fine-Tune
The most common fine-tuning mistake is fine-tuning when you shouldn’t. The second-most-common is not fine-tuning when you should.
The decision flow
Before you fine-tune, try in this order:
- Better prompting (Stage 08). Few-shot examples, structured output, system prompts.
- RAG (Stage 09). If you need fresh / proprietary information.
- Better model. Sometimes upgrading from a 7B to a frontier model is cheaper than fine-tuning.
- Tool use / agents (Stage 11). For workflows that need actions.
- Then consider fine-tuning.
What fine-tuning is good at
- Style and voice: making the model sound a certain way (a brand persona, a code style, a specific tone).
- Format adherence: producing strict outputs with high reliability.
- Domain idioms: medical/legal/finance jargon and conventions.
- Cost reduction: a fine-tuned 7B can match a frontier model on a narrow task at 1/100th the cost.
- Latency reduction: a fine-tuned smaller model is faster than prompting a larger one.
- Refusal calibration: getting the model to answer or decline appropriately.
- Tool use patterns: training models to call your specific tools well.
What fine-tuning is BAD at
- Adding factual knowledge: fine-tuning hallucinates over old facts more than it learns new ones. Use RAG for facts.
- Improving reasoning: small fine-tunes don’t teach reasoning; they make existing reasoning fit a format. Use a reasoning model.
- Mediocre data → magical results: garbage in, garbage out, even at scale.
- One-off behaviors: prompting is way cheaper for “make this output JSON.”
- Frequently changing requirements: fine-tuning is slow; iteration is expensive.
Rule of thumb: if a frontier model with the right prompt and the right context can do it, don’t fine-tune. Fine-tuning is for the persistent gap that prompts can’t close.
Cost & data tradeoffs
Rough numbers (early 2026):
| Approach | Compute | Data | Time | Cost |
|---|---|---|---|---|
| Prompting only | $0 | a few examples | minutes | $0 |
| RAG | $100s | corpus | days | low ongoing |
| LoRA fine-tuning | 1 GPU × hours | 100s–10k examples | hours | $10s–$100s |
| Full SFT | 1–8 GPUs × days | 10k–1M examples | days | $100s–$1000s |
| RLHF / DPO | 1–8 GPUs × days | preference pairs + reward model | weeks | $1000s |
| Reasoning RL | many GPUs × weeks | task suite + verifier | weeks | $10000s+ |
| Pretraining | thousands of GPUs × months | trillions of tokens | months | millions+ |
For most teams, the practical fine-tuning options are: LoRA SFT (cheap and small) and DPO (preference alignment). Anything bigger requires real ML infrastructure.
Scenarios mapped to techniques
| You want to… | Try first |
|---|---|
| Get a specific JSON shape | Prompt + structured output API |
| Use up-to-date / private docs | RAG |
| Reduce token cost on a high-volume task | LoRA-fine-tune a smaller model |
| Make outputs sound like your brand | LoRA SFT on style examples |
| Improve a chatbot’s helpfulness | Frontier model + system prompt |
| Specialize for a narrow domain (medical, legal) | LoRA SFT on domain data + RAG |
| Train a model to use your tools well | SFT on traces of correct tool use |
| Align with human preferences | DPO on pairwise preferences |
| Improve at math/code reasoning | Reasoning RL (hard, expensive) |
Cost-benefit math
Quick check:
Per-call savings × monthly call volume × months in production
vs.
Fine-tune cost + infra + maintenance
If you do 100M calls/month at $0.001 saving each, that’s $100k/month. A $5k fine-tune pays for itself instantly.
If you do 100k calls/month at $0.0005 saving each, that’s $50/month. A fine-tune doesn’t pay back in any reasonable timeframe.
When the data isn’t there
Most teams don’t have 10k high-quality examples for fine-tuning. Options:
- Synthetic data: have a strong model generate (input, output) pairs. Filter/verify carefully.
- Distillation: capture traces of a frontier model on your task, train a smaller model to match.
- Few-shot first: 100 examples as in-context prompts often beats 100 examples as fine-tune data.
- Active learning: deploy a baseline; collect data from real use; fine-tune on the hardest cases.
Closed vs open
- Closed (OpenAI, Anthropic, Google fine-tuning APIs): easy, expensive, locked into vendor.
- Open (LLaMA, Qwen, Mistral, etc.): requires GPUs but you own the model.
Closed APIs are great for SFT-style use cases when you don’t want to manage infra. Open is the path for serious customization, lower per-call cost, or data privacy reasons.
Common anti-patterns
- “Fine-tuning will fix it” without a clear hypothesis about what’s wrong with the base model.
- Fine-tuning on a task the frontier model already does fine — wasted effort.
- Fine-tuning on garbage data — you get a model fluent in garbage.
- No held-out eval — you can’t tell if you improved or just memorized.
- One epoch when you need three (or three when you need one) — train length tuning matters.
- Forgetting general capabilities — the model used to chat, now it only outputs JSON. Mix in some general data.
A useful mental model
Pretraining gives the model knowledge. Instruction tuning gives it manners. RLHF/DPO gives it alignment. Your fine-tune gives it a specialty.
If your “specialty” overlaps with what the manners + knowledge already cover, prompting is enough.