demo

Bigger isn't enough — you also need more data

Slide parameters and training tokens. Watch Chinchilla's loss curve respond. See exactly where GPT-3 went wrong (undertrained) and why LLaMA-3 trains 70B models on 15T tokens.

The Chinchilla formula

Hoffmann et al. (2022) fit a clean three-term loss model: L(N, D) = E + A/N^α + B/D^β. The first term E is irreducible (the data itself is noisy). The second term goes away as you scale parameters. The third goes away as you scale data. Both matter — and their balance matters even more.

Why GPT-3 was the wrong shape

GPT-3 had 175B parameters and 300B training tokens — about 1.7 tokens per parameter. Chinchilla showed that for the same compute, ~70B parameters trained on 1.4T tokens (20× more tokens/param) gives lower loss. GPT-3 was "overspent on parameters, underspent on data."

And why modern models go further

LLaMA-3 70B was trained on 15T tokens — over 200 tokens/param, far past Chinchilla-optimal. Why? Because Chinchilla optimizes training compute, but inference cost scales with N alone. If you're going to serve a model billions of times, you pay any one-time training cost to get a smaller, cheaper-to-serve model. Hence: train smaller, longer.

Anchored to 07-modern-llms/scaling-laws.