step 11 · build
Scale it up: 1M → 10M → 100M params
Same architecture, three sizes. What changes (and what doesn't) when you grow the model. Chinchilla, in practice.
Up to step 09 we trained a 5M-parameter model — small enough to overfit TinyStories on a CPU. Real LLM training operates orders of magnitude bigger: GPT-2 small at 124M, LLaMA-3 8B, Mistral 7B, all the way up to GPT-4-class trillion-parameter models.
This article asks: what actually changes as you scale? Same architecture as step 08 — same Block, same MLP, same multi-head attention. We pick three concrete configs that span two orders of magnitude in parameter count, and walk through what shifts.
The good news: the architecture doesn’t change. You wrote it once. The hard part is the training recipe — learning rate, batch size, training tokens, hardware time, wall-clock cost. That’s where Chinchilla and the Scaling Laws Calculator become useful.
Three configs
# tiny_llm/scaling.py
from tiny_llm.gpt import GPTConfig
# What we trained in step 09. Tiny enough for CPU, overfits fast.
TINY = GPTConfig(
vocab_size=4096, max_seq_len=256,
d_model=192, n_heads=6, n_layers=6,
)
# ~5M params
# A solid "small" model. Trains overnight on a single GPU. Generates
# competent TinyStories. The tier "is this language modeling?" lands.
SMALL = GPTConfig(
vocab_size=4096, max_seq_len=512,
d_model=384, n_heads=6, n_layers=8,
)
# ~17M params
# A "real" model at the bottom of GPT-2 scale. Useful for checking that
# everything still trains stably; shows you the loss curve people quote
# in scaling-laws plots.
MEDIUM = GPTConfig(
vocab_size=4096, max_seq_len=1024,
d_model=768, n_heads=12, n_layers=12,
)
# ~85M params
| TINY | SMALL | MEDIUM | |
|---|---|---|---|
d_model | 192 | 384 | 768 |
n_heads | 6 | 6 | 12 |
n_layers | 6 | 8 | 12 |
max_seq_len | 256 | 512 | 1024 |
| Params | ~5M | ~17M | ~85M |
| Tokens at “Chinchilla optimal” | ~100M | ~340M | ~1.7B |
| Train time on 1× T4 (Colab) | ~10 min | ~2 hours | ~12 hours |
| Train time on 1× A100 | ~2 min | ~20 min | ~2 hours |
The MEDIUM config is roughly GPT-2 small, just with a smaller vocab. It’s the largest config that’s plausibly trainable in a “single notebook session” with current free-tier compute.
What stays the same
Worth saying explicitly: none of the code in steps 02–09 changes. You don’t add a new module, fork a code path, or write distinct logic for the larger configs. You just hand a different GPTConfig to GPT(...) and re-run.
That property — that the same architecture spans 5+ orders of magnitude in parameter count — is one of the more remarkable things about the transformer. Every other architecture-class people have tried (RNNs, CNNs, MoEs in their original 1990s form) breaks somewhere between “small” and “production.” Transformers don’t.
What changes: the training recipe
Three things scale with the model: the learning rate (down), the batch size (up), and the training tokens (up, roughly proportional to params).
Learning rate scales down
Bigger models need smaller per-step updates. The rough rule from the GPT-3 paper:
peak_lr ≈ 0.003 / sqrt(d_model)
For our three configs:
d_model | recommended peak LR | |
|---|---|---|
| TINY | 192 | ~2e-4 |
| SMALL | 384 | ~1.5e-4 |
| MEDIUM | 768 | ~1e-4 |
The numbers we used in step 09 (lr = 3e-4) are slightly aggressive for TINY and would diverge on MEDIUM. The square-root law isn’t exact but it’s a decent default; the Scaling Laws Calculator lets you see the relationship.
Batch size scales up
Bigger batches give better gradient estimates, which help bigger models. Common practice:
| typical batch (sequences) | tokens/step | |
|---|---|---|
| TINY | 32 | ~8K |
| SMALL | 64 | ~32K |
| MEDIUM | 128 | ~131K |
Batch size is bounded by GPU memory. The trick is gradient accumulation (already in step 09’s TrainConfig): if your machine fits batch size 32 but the model wants batch size 128, set batch_size=32, grad_accum_steps=4 and the optimizer sees a 128-effective batch.
Training tokens scale with params (Chinchilla)
The big result from the Chinchilla paper (DeepMind, 2022) is that the optimal number of training tokens is roughly 20× the parameter count.
| params | optimal tokens (~20× rule) | |
|---|---|---|
| TINY | 5M | 100M |
| SMALL | 17M | 340M |
| MEDIUM | 85M | 1.7B |
| GPT-3 | 175B | 3.5T |
GPT-3 was famously under-trained at 300B tokens — Chinchilla showed that the same compute spent on a smaller model with more tokens would have produced a better model. LLaMA-3 trained 70B parameters on 15T tokens — way over the Chinchilla ratio, deliberately, because once a model is deployed the cost per inference matters more than the training compute.
Our TinyStories train.bin is ~470M tokens. So:
- TINY can re-walk the training set 4–5× (10000 steps × 8K tokens ≈ 80M tokens)
- SMALL can comfortably train Chinchilla-optimal in one pass
- MEDIUM is undertrained on TinyStories alone — you’d want a larger corpus
The Scaling Laws Calculator lets you slide model size and tokens to see the predicted loss; play with it before you commit to a config.
What doesn’t scale: warmup steps
Worth noting: warmup (the linear LR ramp at the start) stays roughly constant in number of steps, not as a fraction of training. 200–500 warmup steps works well across our three configs.
Wall-clock and money
Approximate costs on cloud GPUs:
| Single A100 (~$1.50/hr cloud) | Single T4 free tier | |
|---|---|---|
| TINY | $0.05 | 10 min |
| SMALL | $0.50 | 2 hours |
| MEDIUM | $3 | 12 hours (probably timeout) |
This is the “academic” scale — within reach of an individual with a credit card and a weekend. Everything beyond MEDIUM (GPT-2 medium / large, Mistral 7B, LLaMA 8B) lives in serious-cluster territory: 8–512 GPUs in parallel, days of training.
If you want the experience of “I trained a model that’s actually good,” train SMALL on TinyStories for a few hours. The output is genuinely coherent. If you want the experience of “I reproduced a 2018 paper,” train MEDIUM on a bigger corpus and you’ll be at GPT-2-small parity.
A practical scaling experiment
If you have an hour and a GPU (Colab is fine), try this:
# scratch_scale_compare.py
from tiny_llm.gpt import GPTConfig
from tiny_llm.train import train, TrainConfig
# Small but not tiny.
small = GPTConfig(d_model=384, n_heads=6, n_layers=8, max_seq_len=512)
train(
gpt_cfg=small,
train_cfg=TrainConfig(
max_steps=10_000,
warmup_steps=500,
lr=1.5e-4,
batch_size=64,
eval_interval=500,
out_dir="checkpoints/small",
),
)
Two specific things to watch:
-
The val loss plateau. TINY plateaus around 1.7 on TinyStories; SMALL should hit ~1.4–1.5. That ~0.2-bit/token improvement looks small but produces noticeably better text — the model’s perplexity drops from ~5.5 to ~4.4.
-
The sample quality. After training, generate from each model with the same prompt and read both outputs side by side. The TINY output is recognizably “trying”; the SMALL output is fluent. That gap is what scaling buys you.
Memory math
The thing that decides whether a model fits on your GPU at all is memory, not FLOPs. Rough breakdown for a forward+backward training step:
total memory ≈ params × 4 bytes # parameters (fp32)
+ params × 4 bytes # gradients (same shape as params)
+ params × 8 bytes # AdamW state (m, v per param, fp32)
+ activations # ~ batch × seq × d × layers × ~10 bytes
Per parameter: ~16 bytes during training. For MEDIUM (85M params): ~1.4 GB just for params + grads + optimizer state. Activations push it to ~3–4 GB at batch=8 / seq=1024. Fits on a 16 GB V100, struggles on a 12 GB T4.
Standard tricks for fitting bigger models on smaller GPUs:
- Mixed-precision training (bf16/fp16): cuts param/grad memory in half, ~2× speedup. Add
torch.amp.autocast(...)around the forward pass. - Gradient accumulation: smaller per-step batch, more steps.
- Gradient checkpointing: recompute activations in the backward pass instead of storing them. ~2× compute cost, ~5× memory savings.
- Sharded optimizer states (ZeRO): split the optimizer state across GPUs. Worth it past 1B params.
Most of these we won’t need, but they’re the standard scaling-up vocabulary you’ll see in any production training script.
What we did and didn’t do
What we did:
- Sized three concrete configs spanning ~5M to ~85M params
- Spelled out which hyperparameters scale (LR, batch, tokens) and which don’t (warmup, architecture)
- Pointed at where Chinchilla’s rule lands for each config
- Estimated wall-clock and cost on standard hardware
- Sketched the standard memory-saving tricks
What we didn’t:
- Train multiple sizes ourselves. I don’t want this article to depend on having a GPU; the configs and recipes above are runnable, but the “see them learn” part is opt-in. If you train SMALL, the ~17M-param model should produce noticeably better TinyStories output than TINY.
- Discuss inference scaling. Different topic. Step 14 covers KV cache, which is the dominant inference-time concern.
- Distributed training. Single-machine assumption. Multi-GPU (DDP, FSDP, ZeRO-3) is its own engineering specialization; the same architecture extends, the recipe extends, the framework you use changes.
Cross-references
- Scaling Laws Calculator — interactive Chinchilla. Slide model size and tokens, watch predicted loss respond.
- Scaling Laws article — the deeper theory, including isoFLOP curves and why GPT-3 was suboptimally trained.
Next
Step 12 swaps gears: we have a base model that knows how to predict tokens. Now we fine-tune it to follow instructions, using LoRA (low-rank adaptation) — a way to adapt the model with 800× fewer trainable parameters than full fine-tuning. It’s how every consumer fine-tune service works under the hood.