Scaling Laws
Bigger model + more data + more compute = lower loss. Predictably. That predictability is what makes foundation-model economics work.
The Kaplan paper (2020)
OpenAI’s “Scaling Laws for Neural Language Models” (Kaplan et al. 2020) showed:
- Loss scales as a power law in model parameters
N, dataset tokensD, and computeC. - Each individual factor saturates if the others don’t grow.
- Optimal allocation: spend most of your compute on model size.
L(N, D) = A · N^(-α_N) + B · D^(-α_D) + L_∞
For models trained to convergence, the curves are remarkably smooth across orders of magnitude.
The Chinchilla correction (2022)
DeepMind’s “Training Compute-Optimal Large Language Models” (Hoffmann et al. 2022) reanalyzed scaling and found:
For a given compute budget, you should train a smaller model on more tokens than Kaplan’s analysis suggested.
The compute-optimal rule: train ~20 tokens per parameter.
So for a fixed compute budget C ≈ 6 · N · D:
- A 70B model should see ~1.4T tokens.
- A 7B model should see ~140B tokens.
GPT-3 (175B trained on ~300B tokens) was massively under-trained by this rule. The Chinchilla 70B model, trained on 1.4T tokens, beat GPT-3 despite being less than half the size.
Beyond Chinchilla: tokens, tokens, tokens
The Chinchilla optimal balances training compute against final loss. But two factors push toward training even smaller models for even more tokens:
- Inference cost. A smaller model is cheaper to serve forever. Even if training is sub-optimal, the lifetime cost favors smaller models on more data.
- Inference latency. Same reason.
LLaMA-2 7B was trained on ~2T tokens (~280 tokens/parameter — way past Chinchilla optimal). The “training compute” was wasted relative to a bigger model — but the resulting 7B is faster, cheaper, and good enough for many uses.
By 2026, the rule of thumb is: compute-optimal during training, but trained for longer than that to amortize inference cost.
What loss curves look like
Plot training loss vs compute on log-log axes. You see:
- Initial fast drop (warm-up phase, learning syntax)
- A clean power-law middle section (the regime predicted by scaling laws)
- Eventually a plateau as you approach
L_∞
Different models with the same compute end up close to the same loss along the power law. This is why scaling laws are predictive: if you know your compute, you can estimate where you’ll land.
Emergent capabilities
Some tasks show smooth scaling in loss but sharp jumps in benchmark accuracy. Multi-step reasoning, tool use, instruction following often emerge at specific scales.
This was once described as “emergence.” Later work argues it’s mostly an artifact of:
- Discontinuous metrics (exact match vs. partial credit)
- Multi-step tasks where every step must succeed
Smooth metrics tell a smoother story. But the operational fact remains: a 1B model can’t do what a 70B model does on certain tasks, even with the same architecture.
The compute curve
Modern frontier compute scales:
| Year | Frontier compute (training) |
|---|---|
| 2018 (BERT-large) | ~3 × 10²¹ FLOP |
| 2020 (GPT-3) | ~3 × 10²³ FLOP |
| 2023 (GPT-4 estimated) | ~2 × 10²⁵ FLOP |
| 2025 (frontier estimated) | ~10²⁶ FLOP |
Roughly 10× per year. This pace is starting to slow due to data and energy constraints.
Data is the bottleneck
Web-scale text data is ~10–50T tokens depending on quality bar. At Chinchilla optimal of 20 tokens/param, this caps “naive” scaling at ~500B–2.5T parameters.
To go further, the field is exploring:
- Multi-epoch training (train on the same data multiple times — diminishing returns).
- Synthetic data (LLM-generated, often filtered or distilled).
- Multi-modal data (images, video, audio expand the corpus).
- Code as compute (using code execution for reasoning training).
- Higher-quality filtering (FineWeb, FineWeb-Edu, deduplication tricks).
Inference scaling laws
A second axis of scaling: test-time compute. Reasoning models (o1, o3, Claude with extended thinking, etc.) achieve more by doing more compute per query — generating long chains of thought, then sampling many of them, then aggregating.
The “scaling law” here is roughly:
Doubling test-time compute on a hard task can match a model trained with much more pretraining compute.
We unpack this in the Reasoning models article.
Practical implications
For builders:
- Pick the smallest model that can do your task. Inference adds up; training is paid once.
- Don’t underestimate small frontier models. A 70B-2025 outperforms a 175B-2020.
- Use scaling laws for back-of-envelope estimates. If a 7B doesn’t improve from 100B → 200B tokens of training, it won’t suddenly improve at 400B.
- For hard tasks, consider reasoning models. Test-time compute is often a better lever than model size.
See also
- Mixture of Experts — increase parameter count without proportional compute
- Reasoning models — test-time scaling
- Frontier architectures