step 11 · build

Scale it up: 1M → 10M → 100M params

Same architecture, three sizes. What changes (and what doesn't) when you grow the model. Chinchilla, in practice.

training

Up to step 09 we trained a 5M-parameter model — small enough to overfit TinyStories on a CPU. Real LLM training operates orders of magnitude bigger: GPT-2 small at 124M, LLaMA-3 8B, Mistral 7B, all the way up to GPT-4-class trillion-parameter models.

This article asks: what actually changes as you scale? Same architecture as step 08 — same Block, same MLP, same multi-head attention. We pick three concrete configs that span two orders of magnitude in parameter count, and walk through what shifts.

The good news: the architecture doesn’t change. You wrote it once. The hard part is the training recipe — learning rate, batch size, training tokens, hardware time, wall-clock cost. That’s where Chinchilla and the Scaling Laws Calculator become useful.

Three configs

# tiny_llm/scaling.py
from tiny_llm.gpt import GPTConfig

# What we trained in step 09. Tiny enough for CPU, overfits fast.
TINY = GPTConfig(
    vocab_size=4096, max_seq_len=256,
    d_model=192, n_heads=6, n_layers=6,
)
# ~5M params

# A solid "small" model. Trains overnight on a single GPU. Generates
# competent TinyStories. The tier "is this language modeling?" lands.
SMALL = GPTConfig(
    vocab_size=4096, max_seq_len=512,
    d_model=384, n_heads=6, n_layers=8,
)
# ~17M params

# A "real" model at the bottom of GPT-2 scale. Useful for checking that
# everything still trains stably; shows you the loss curve people quote
# in scaling-laws plots.
MEDIUM = GPTConfig(
    vocab_size=4096, max_seq_len=1024,
    d_model=768, n_heads=12, n_layers=12,
)
# ~85M params

	TINY	SMALL	MEDIUM
`d_model`	192	384	768
`n_heads`	6	6	12
`n_layers`	6	8	12
`max_seq_len`	256	512	1024
Params	~5M	~17M	~85M
Tokens at “Chinchilla optimal”	~100M	~340M	~1.7B
Train time on 1× T4 (Colab)	~10 min	~2 hours	~12 hours
Train time on 1× A100	~2 min	~20 min	~2 hours

The MEDIUM config is roughly GPT-2 small, just with a smaller vocab. It’s the largest config that’s plausibly trainable in a “single notebook session” with current free-tier compute.

What stays the same

Worth saying explicitly: none of the code in steps 02–09 changes. You don’t add a new module, fork a code path, or write distinct logic for the larger configs. You just hand a different GPTConfig to GPT(...) and re-run.

That property — that the same architecture spans 5+ orders of magnitude in parameter count — is one of the more remarkable things about the transformer. Every other architecture-class people have tried (RNNs, CNNs, MoEs in their original 1990s form) breaks somewhere between “small” and “production.” Transformers don’t.

What changes: the training recipe

Three things scale with the model: the learning rate (down), the batch size (up), and the training tokens (up, roughly proportional to params).

Learning rate scales down

Bigger models need smaller per-step updates. The rough rule from the GPT-3 paper:

peak_lr ≈ 0.003 / sqrt(d_model)

For our three configs:

	`d_model`	recommended peak LR
TINY	192	~2e-4
SMALL	384	~1.5e-4
MEDIUM	768	~1e-4

The numbers we used in step 09 (lr = 3e-4) are slightly aggressive for TINY and would diverge on MEDIUM. The square-root law isn’t exact but it’s a decent default; the Scaling Laws Calculator lets you see the relationship.

Batch size scales up

Bigger batches give better gradient estimates, which help bigger models. Common practice:

	typical batch (sequences)	tokens/step
TINY	32	~8K
SMALL	64	~32K
MEDIUM	128	~131K

Batch size is bounded by GPU memory. The trick is gradient accumulation (already in step 09’s TrainConfig): if your machine fits batch size 32 but the model wants batch size 128, set batch_size=32, grad_accum_steps=4 and the optimizer sees a 128-effective batch.

Training tokens scale with params (Chinchilla)

The big result from the Chinchilla paper (DeepMind, 2022) is that the optimal number of training tokens is roughly 20× the parameter count.

	params	optimal tokens (~20× rule)
TINY	5M	100M
SMALL	17M	340M
MEDIUM	85M	1.7B
GPT-3	175B	3.5T

GPT-3 was famously under-trained at 300B tokens — Chinchilla showed that the same compute spent on a smaller model with more tokens would have produced a better model. LLaMA-3 trained 70B parameters on 15T tokens — way over the Chinchilla ratio, deliberately, because once a model is deployed the cost per inference matters more than the training compute.

Our TinyStories train.bin is ~470M tokens. So:

TINY can re-walk the training set 4–5× (10000 steps × 8K tokens ≈ 80M tokens)
SMALL can comfortably train Chinchilla-optimal in one pass
MEDIUM is undertrained on TinyStories alone — you’d want a larger corpus

The Scaling Laws Calculator lets you slide model size and tokens to see the predicted loss; play with it before you commit to a config.

What doesn’t scale: warmup steps

Worth noting: warmup (the linear LR ramp at the start) stays roughly constant in number of steps, not as a fraction of training. 200–500 warmup steps works well across our three configs.

Wall-clock and money

Approximate costs on cloud GPUs:

	Single A100 (~$1.50/hr cloud)	Single T4 free tier
TINY	$0.05	10 min
SMALL	$0.50	2 hours
MEDIUM	$3	12 hours (probably timeout)

This is the “academic” scale — within reach of an individual with a credit card and a weekend. Everything beyond MEDIUM (GPT-2 medium / large, Mistral 7B, LLaMA 8B) lives in serious-cluster territory: 8–512 GPUs in parallel, days of training.

If you want the experience of “I trained a model that’s actually good,” train SMALL on TinyStories for a few hours. The output is genuinely coherent. If you want the experience of “I reproduced a 2018 paper,” train MEDIUM on a bigger corpus and you’ll be at GPT-2-small parity.

A practical scaling experiment

If you have an hour and a GPU (Colab is fine), try this:

# scratch_scale_compare.py
from tiny_llm.gpt import GPTConfig
from tiny_llm.train import train, TrainConfig

# Small but not tiny.
small = GPTConfig(d_model=384, n_heads=6, n_layers=8, max_seq_len=512)

train(
    gpt_cfg=small,
    train_cfg=TrainConfig(
        max_steps=10_000,
        warmup_steps=500,
        lr=1.5e-4,
        batch_size=64,
        eval_interval=500,
        out_dir="checkpoints/small",
    ),
)

Two specific things to watch:

The val loss plateau. TINY plateaus around 1.7 on TinyStories; SMALL should hit ~1.4–1.5. That ~0.2-bit/token improvement looks small but produces noticeably better text — the model’s perplexity drops from ~5.5 to ~4.4.
The sample quality. After training, generate from each model with the same prompt and read both outputs side by side. The TINY output is recognizably “trying”; the SMALL output is fluent. That gap is what scaling buys you.

Memory math

The thing that decides whether a model fits on your GPU at all is memory, not FLOPs. Rough breakdown for a forward+backward training step:

total memory ≈ params × 4 bytes      # parameters (fp32)
             + params × 4 bytes      # gradients (same shape as params)
             + params × 8 bytes      # AdamW state (m, v per param, fp32)
             + activations           # ~ batch × seq × d × layers × ~10 bytes

Per parameter: ~16 bytes during training. For MEDIUM (85M params): ~1.4 GB just for params + grads + optimizer state. Activations push it to ~3–4 GB at batch=8 / seq=1024. Fits on a 16 GB V100, struggles on a 12 GB T4.

Standard tricks for fitting bigger models on smaller GPUs:

Mixed-precision training (bf16/fp16): cuts param/grad memory in half, ~2× speedup. Add torch.amp.autocast(...) around the forward pass.
Gradient accumulation: smaller per-step batch, more steps.
Gradient checkpointing: recompute activations in the backward pass instead of storing them. ~2× compute cost, ~5× memory savings.
Sharded optimizer states (ZeRO): split the optimizer state across GPUs. Worth it past 1B params.

Most of these we won’t need, but they’re the standard scaling-up vocabulary you’ll see in any production training script.

What we did and didn’t do

What we did:

Sized three concrete configs spanning ~5M to ~85M params
Spelled out which hyperparameters scale (LR, batch, tokens) and which don’t (warmup, architecture)
Pointed at where Chinchilla’s rule lands for each config
Estimated wall-clock and cost on standard hardware
Sketched the standard memory-saving tricks

What we didn’t:

Train multiple sizes ourselves. I don’t want this article to depend on having a GPU; the configs and recipes above are runnable, but the “see them learn” part is opt-in. If you train SMALL, the ~17M-param model should produce noticeably better TinyStories output than TINY.
Discuss inference scaling. Different topic. Step 14 covers KV cache, which is the dominant inference-time concern.
Distributed training. Single-machine assumption. Multi-GPU (DDP, FSDP, ZeRO-3) is its own engineering specialization; the same architecture extends, the recipe extends, the framework you use changes.

Cross-references

Scaling Laws Calculator — interactive Chinchilla. Slide model size and tokens, watch predicted loss respond.
Scaling Laws article — the deeper theory, including isoFLOP curves and why GPT-3 was suboptimally trained.

Step 12 swaps gears: we have a base model that knows how to predict tokens. Now we fine-tune it to follow instructions, using LoRA (low-rank adaptation) — a way to adapt the model with 800× fewer trainable parameters than full fine-tuning. It’s how every consumer fine-tune service works under the hood.