Stage 07 — Modern LLMs

The transformer of 2017 vs the transformer of 2026 is the same skeleton with a lot of new muscle. This stage covers what changed: scaling laws, sparse activation (MoE), reasoning models, long context, and the broader architectural landscape.

Prerequisites

Stage 06 (transformers)

Learning ladder

Scaling laws — Chinchilla, compute-optimal training
Mixture of Experts — sparse activation, routing
Reasoning models — o-series, R1, test-time compute
Long context — 1M+ tokens, attention variants, retrieval interplay
Frontier architectures — what 2026 frontier models look like

MVU

You can:

State the Chinchilla rule of thumb (≈20 tokens per parameter) and explain why it matters
Describe how MoE changes parameter count vs active parameters
Distinguish a reasoning model from a non-reasoning model and explain when to use each
Pick a model size and architecture for a given budget and use case

Exercise

Run the same hard prompt (e.g. a multi-step math word problem) through three models:

A small fast model (e.g. Haiku 4.5)
A frontier non-reasoning model (e.g. Sonnet 4.6)
A frontier reasoning model (e.g. an o-series model or Claude with extended thinking)

Compare:

Output quality
Latency
Cost
Confidence on simple sub-questions

Field reports — real-world case studies

Field report: DeepSeek-R1 — pure-RL reasoning training, the “aha moment,” and the published distillation recipe. The real-world worked example for reasoning-models.md. Anchored to arxiv:2501.12948.

Hands-on companions

After the theory, three concrete next stops:

Watch it interactively:

Scaling Laws — drag log10(N) and log10(D); watch loss, FLOPs, $, tokens/param all retween live with the real Chinchilla constants.
Reasoning Models — discrete reasoning_effort slider (none/low/medium/high) showing real-feeling reasoning traces, token counts, latency, and the predicted answer flip on the bat-and-ball problem.
MoE Routing — real router math: word → 16-d hashed embedding → seeded W_router (16×8) → softmax → top-K. The router-math panel shows the live computation.
Long Context — attention mask topology under sliding-window, strided, global-local patterns. The cost math is on the page.

Ship the stack:

/ship/01 — pick a model — Llama vs Qwen vs Mistral vs Phi; the production decision matrix.
/ship/14 — cost and latency — quantization, prompt cache, speculative decoding; the levers that turn an “idea” into “shippable cost.”