Data & Tooling

The hardest part of fine-tuning is the data, not the training. The second hardest is operating the tooling. Here’s a practical map.

Tooling landscape (early 2026)

TRL (HuggingFace)

The reference implementation. Supports SFT, DPO, KTO, ORPO, GRPO. Works with PEFT (LoRA), bitsandbytes (QLoRA), DeepSpeed, FSDP.

from trl import SFTTrainer, SFTConfig

Pros: official, current, integrates with HuggingFace ecosystem. Cons: configuration sprawl; some options need source diving.

Axolotl

Config-driven YAML training pipeline. Wraps TRL/Accelerate/DeepSpeed.

base_model: meta-llama/Llama-3.1-8B-Instruct
adapter: lora
lora_r: 16
datasets:
  - path: ./my_data.jsonl
    type: alpaca

Pros: minimal Python, reproducible configs, well-tuned defaults for many models. Cons: less flexible than raw TRL.

Default for many open-source fine-tuning projects.

Unsloth

Optimized SFT/DPO with 2× faster training and 50% less memory than TRL on consumer GPUs.

Pros: fastest single-GPU training in the ecosystem; gentle learning curve. Cons: focused on consumer hardware; some advanced features take longer to land.

LlamaFactory

Web UI + CLI for fine-tuning. Good for teams who want a managed training experience without writing code.

Megatron-LM, NeMo

For multi-node, multi-GPU full-scale training. Used for serious pretraining and large fine-tunes.

Cloud platforms

Modal, RunPod, Lambda, Together AI — rent GPUs for training.
AWS SageMaker, GCP Vertex AI, Azure ML — full ML platforms.
Anyscale, MosaicML (Databricks) — managed distributed training.

For a one-off fine-tune, Modal or RunPod is simplest. For scale, the bigger platforms.

Closed-source fine-tuning APIs

OpenAI fine-tuning (GPT-4o, GPT-4o-mini): SFT only, easy, expensive.
Anthropic fine-tuning (Claude): limited availability.
Together AI fine-tuning: open-source models, cheap, simple.
Fireworks, Anyscale, Replicate: similar.

For prototyping, closed-source is fastest. For ongoing use, open-source typically wins on cost.

Data formats

The de facto standards:

Chat / conversations (most common)

{
  "messages": [
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."}
  ]
}

JSONL, one record per line.

Alpaca format (legacy but still common)

{
  "instruction": "Translate to French",
  "input": "Hello",
  "output": "Bonjour"
}

DPO / preference

{
  "prompt": "...",
  "chosen": "...",
  "rejected": "..."
}

Embedding training

{"query": "...", "positive": "...", "negative": "..."}

Building good fine-tuning data

Sources

Real user data: best signal, but requires a deployed product.
Distilled from a strong model: cheap, scalable; check for licensing.
Synthetic from your specs: have an LLM generate examples covering your distribution.
Existing public datasets: seed mixture (open-instruct, ShareGPT, Tulu, OpenOrca, etc.) for general capability retention.

Quality > quantity

A curated 1k beats a noisy 100k. Specific things to do:

Hand-review the first 50–100 examples before training.
Filter out short/empty/obviously broken records.
Deduplicate near-duplicates.
Balance difficulty and topic.
Verify formatting matches the model’s chat template.

Mix general data

If you’re worried about catastrophic forgetting, mix in 5–20% of general instruction data (e.g. open-instruct-3M). Helps the model retain non-target capabilities.

Compute estimation

Quick rule of thumb for SFT/LoRA:

GPU-hours ≈ (params × tokens × passes) / (5 × 10^11 × GPU TFLOPs)

For a 7B model, 10k examples × 1k tokens × 3 epochs ≈ 30M tokens trained. On an A100 (~300 TFLOPs):

hours ≈ (7e9 × 30e6 × 1) / (5e11 × 300) ≈ 1.4 hours

LoRA roughly 1.5× faster than full FT (smaller backward).

For 70B with QLoRA on a single 80GB GPU: ~10–30× longer than 7B SFT for the same data.

Distributed training

When one GPU isn’t enough:

DDP (Distributed Data Parallel): replicate model on each GPU, shard data. Default for small/medium models.
FSDP (Fully Sharded Data Parallel): shard parameters, gradients, optimizer state across GPUs. PyTorch native; the modern default for large models.
DeepSpeed ZeRO: similar to FSDP, with more knobs (ZeRO-1, -2, -3 progressively shard more).
Tensor / pipeline parallel: split a single model across GPUs. For models that don’t fit on one GPU (e.g. 70B+ in fp16).

For most teams, FSDP + LoRA handles up to 70B-class models on 4–8 GPUs.

Monitoring training

What to log:

Loss curves (train and validation).
Learning rate schedule.
Gradient norms (catch instability).
Per-token loss distribution (find which examples are hardest).
GPU utilization (catch dataloader bottlenecks).
Memory (avoid OOM).

Tools: Weights & Biases, MLflow, TensorBoard, Aim. Weights & Biases is the de facto standard for fine-tuning.

Debugging a stuck training

If loss isn’t dropping:

Check the chat template. Wrong template = model never learns.
Check loss masking. If you’re computing loss on prompts, signal gets diluted.
Check learning rate. Too low = slow; too high = diverge.
Check the data. A glance at 5 examples often catches a formatting issue.

If loss drops fast then stalls:

Likely overfitting on a small dataset.
Lower the learning rate, add data, or stop earlier.

If loss drops then explodes:

Gradient instability. Add clipping (max_grad_norm=1.0).
Reduce learning rate.
Check for any NaN-producing examples.

Saving and serving

After training:

Checkpoints: save adapter (small) or merged model (big).
Tokenizer: save alongside the model — critical for inference.
Config: chat template, special tokens, generation defaults.
Eval results: for the README of the model artifact.
Versioning: keep prior versions; you might need to roll back.

Serving:

vLLM: highest-performance OSS inference; supports LoRA serving.
SGLang: similar; good for structured generation.
TGI (HuggingFace): production-ready inference server.
llama.cpp: CPU/edge serving, GGUF format.
Together / Fireworks / Anyscale: managed inference for your fine-tuned model.

Reproducibility

Fine-tuning runs vary across hardware. To minimize:

Set seeds (torch.manual_seed, np.random.seed).
Pin library versions in requirements.txt.
Save the exact data version (hash the JSONL).
Save training configs alongside checkpoints.
Log everything to W&B or similar.

Common operational pitfalls

OOM mid-training: usually fixable with gradient accumulation or smaller seq length.
Slow data loading: dataloader workers, file format matters (use Arrow/Parquet for large datasets).
Tokenizer vocab mismatch: chat templates differ; use the right one for the right base model.
Disk full: big checkpoints + WandB artifacts. Plan storage.
No backup of training data: if you lose it, you lose reproducibility.