demo
Bigger isn't enough — you also need more data
Slide parameters and training tokens. Watch Chinchilla's loss curve respond. See exactly where GPT-3 went wrong (undertrained) and why LLaMA-3 trains 70B models on 15T tokens.
The Chinchilla formula
Hoffmann et al. (2022) fit a clean three-term loss model:
L(N, D) = E + A/N^α + B/D^β. The first term E is
irreducible (the data itself is noisy). The second term goes
away as you scale parameters. The third goes away as you scale
data. Both matter — and their balance matters even more.
Why GPT-3 was the wrong shape
GPT-3 had 175B parameters and 300B training tokens — about 1.7 tokens per parameter. Chinchilla showed that for the same compute, ~70B parameters trained on 1.4T tokens (20× more tokens/param) gives lower loss. GPT-3 was "overspent on parameters, underspent on data."
And why modern models go further
LLaMA-3 70B was trained on 15T tokens — over 200 tokens/param, far past Chinchilla-optimal. Why? Because Chinchilla optimizes training compute, but inference cost scales with N alone. If you're going to serve a model billions of times, you pay any one-time training cost to get a smaller, cheaper-to-serve model. Hence: train smaller, longer.
Anchored to 07-modern-llms/scaling-laws.