Scaling Laws

Bigger model + more data + more compute = lower loss. Predictably. That predictability is what makes foundation-model economics work.

The Kaplan paper (2020)

OpenAI’s “Scaling Laws for Neural Language Models” (Kaplan et al. 2020) showed:

  • Loss scales as a power law in model parameters N, dataset tokens D, and compute C.
  • Each individual factor saturates if the others don’t grow.
  • Optimal allocation: spend most of your compute on model size.
L(N, D) = A · N^(-α_N) + B · D^(-α_D) + L_∞

For models trained to convergence, the curves are remarkably smooth across orders of magnitude.

The Chinchilla correction (2022)

DeepMind’s “Training Compute-Optimal Large Language Models” (Hoffmann et al. 2022) reanalyzed scaling and found:

For a given compute budget, you should train a smaller model on more tokens than Kaplan’s analysis suggested.

The compute-optimal rule: train ~20 tokens per parameter.

So for a fixed compute budget C ≈ 6 · N · D:

  • A 70B model should see ~1.4T tokens.
  • A 7B model should see ~140B tokens.

GPT-3 (175B trained on ~300B tokens) was massively under-trained by this rule. The Chinchilla 70B model, trained on 1.4T tokens, beat GPT-3 despite being less than half the size.

Beyond Chinchilla: tokens, tokens, tokens

The Chinchilla optimal balances training compute against final loss. But two factors push toward training even smaller models for even more tokens:

  1. Inference cost. A smaller model is cheaper to serve forever. Even if training is sub-optimal, the lifetime cost favors smaller models on more data.
  2. Inference latency. Same reason.

LLaMA-2 7B was trained on ~2T tokens (~280 tokens/parameter — way past Chinchilla optimal). The “training compute” was wasted relative to a bigger model — but the resulting 7B is faster, cheaper, and good enough for many uses.

By 2026, the rule of thumb is: compute-optimal during training, but trained for longer than that to amortize inference cost.

What loss curves look like

Plot training loss vs compute on log-log axes. You see:

  • Initial fast drop (warm-up phase, learning syntax)
  • A clean power-law middle section (the regime predicted by scaling laws)
  • Eventually a plateau as you approach L_∞

Different models with the same compute end up close to the same loss along the power law. This is why scaling laws are predictive: if you know your compute, you can estimate where you’ll land.

Emergent capabilities

Some tasks show smooth scaling in loss but sharp jumps in benchmark accuracy. Multi-step reasoning, tool use, instruction following often emerge at specific scales.

This was once described as “emergence.” Later work argues it’s mostly an artifact of:

  • Discontinuous metrics (exact match vs. partial credit)
  • Multi-step tasks where every step must succeed

Smooth metrics tell a smoother story. But the operational fact remains: a 1B model can’t do what a 70B model does on certain tasks, even with the same architecture.

The compute curve

Modern frontier compute scales:

YearFrontier compute (training)
2018 (BERT-large)~3 × 10²¹ FLOP
2020 (GPT-3)~3 × 10²³ FLOP
2023 (GPT-4 estimated)~2 × 10²⁵ FLOP
2025 (frontier estimated)~10²⁶ FLOP

Roughly 10× per year. This pace is starting to slow due to data and energy constraints.

Data is the bottleneck

Web-scale text data is ~10–50T tokens depending on quality bar. At Chinchilla optimal of 20 tokens/param, this caps “naive” scaling at ~500B–2.5T parameters.

To go further, the field is exploring:

  • Multi-epoch training (train on the same data multiple times — diminishing returns).
  • Synthetic data (LLM-generated, often filtered or distilled).
  • Multi-modal data (images, video, audio expand the corpus).
  • Code as compute (using code execution for reasoning training).
  • Higher-quality filtering (FineWeb, FineWeb-Edu, deduplication tricks).

Inference scaling laws

A second axis of scaling: test-time compute. Reasoning models (o1, o3, Claude with extended thinking, etc.) achieve more by doing more compute per query — generating long chains of thought, then sampling many of them, then aggregating.

The “scaling law” here is roughly:

Doubling test-time compute on a hard task can match a model trained with much more pretraining compute.

We unpack this in the Reasoning models article.

Practical implications

For builders:

  1. Pick the smallest model that can do your task. Inference adds up; training is paid once.
  2. Don’t underestimate small frontier models. A 70B-2025 outperforms a 175B-2020.
  3. Use scaling laws for back-of-envelope estimates. If a 7B doesn’t improve from 100B → 200B tokens of training, it won’t suddenly improve at 400B.
  4. For hard tasks, consider reasoning models. Test-time compute is often a better lever than model size.

See also