Stage 07 — Modern LLMs
The transformer of 2017 vs the transformer of 2026 is the same skeleton with a lot of new muscle. This stage covers what changed: scaling laws, sparse activation (MoE), reasoning models, long context, and the broader architectural landscape.
Prerequisites
- Stage 06 (transformers)
Learning ladder
- Scaling laws — Chinchilla, compute-optimal training
- Mixture of Experts — sparse activation, routing
- Reasoning models — o-series, R1, test-time compute
- Long context — 1M+ tokens, attention variants, retrieval interplay
- Frontier architectures — what 2026 frontier models look like
MVU
You can:
- State the Chinchilla rule of thumb (≈20 tokens per parameter) and explain why it matters
- Describe how MoE changes parameter count vs active parameters
- Distinguish a reasoning model from a non-reasoning model and explain when to use each
- Pick a model size and architecture for a given budget and use case
Exercise
Run the same hard prompt (e.g. a multi-step math word problem) through three models:
- A small fast model (e.g. Haiku 4.5)
- A frontier non-reasoning model (e.g. Sonnet 4.6)
- A frontier reasoning model (e.g. an o-series model or Claude with extended thinking)
Compare:
- Output quality
- Latency
- Cost
- Confidence on simple sub-questions
Field reports — real-world case studies
- Field report: DeepSeek-R1 — pure-RL reasoning training, the “aha moment,” and the published distillation recipe. The real-world worked example for
reasoning-models.md. Anchored to arxiv:2501.12948.
Hands-on companions
After the theory, three concrete next stops:
Watch it interactively:
- Scaling Laws — drag log10(N) and log10(D); watch loss, FLOPs, $, tokens/param all retween live with the real Chinchilla constants.
- Reasoning Models — discrete reasoning_effort slider (none/low/medium/high) showing real-feeling reasoning traces, token counts, latency, and the predicted answer flip on the bat-and-ball problem.
- MoE Routing — real router math: word → 16-d hashed embedding → seeded W_router (16×8) → softmax → top-K. The router-math panel shows the live computation.
- Long Context — attention mask topology under sliding-window, strided, global-local patterns. The cost math is on the page.
Ship the stack:
/ship/01— pick a model — Llama vs Qwen vs Mistral vs Phi; the production decision matrix./ship/14— cost and latency — quantization, prompt cache, speculative decoding; the levers that turn an “idea” into “shippable cost.”



