Stage 07 — Modern LLMs

The transformer of 2017 vs the transformer of 2026 is the same skeleton with a lot of new muscle. This stage covers what changed: scaling laws, sparse activation (MoE), reasoning models, long context, and the broader architectural landscape.

Prerequisites

  • Stage 06 (transformers)

Learning ladder

  1. Scaling laws — Chinchilla, compute-optimal training
  2. Mixture of Experts — sparse activation, routing
  3. Reasoning models — o-series, R1, test-time compute
  4. Long context — 1M+ tokens, attention variants, retrieval interplay
  5. Frontier architectures — what 2026 frontier models look like

MVU

You can:

  • State the Chinchilla rule of thumb (≈20 tokens per parameter) and explain why it matters
  • Describe how MoE changes parameter count vs active parameters
  • Distinguish a reasoning model from a non-reasoning model and explain when to use each
  • Pick a model size and architecture for a given budget and use case

Exercise

Run the same hard prompt (e.g. a multi-step math word problem) through three models:

  • A small fast model (e.g. Haiku 4.5)
  • A frontier non-reasoning model (e.g. Sonnet 4.6)
  • A frontier reasoning model (e.g. an o-series model or Claude with extended thinking)

Compare:

  • Output quality
  • Latency
  • Cost
  • Confidence on simple sub-questions

Field reports — real-world case studies

Hands-on companions

After the theory, three concrete next stops:

Watch it interactively:

  • Scaling Laws — drag log10(N) and log10(D); watch loss, FLOPs, $, tokens/param all retween live with the real Chinchilla constants.
  • Reasoning Models — discrete reasoning_effort slider (none/low/medium/high) showing real-feeling reasoning traces, token counts, latency, and the predicted answer flip on the bat-and-ball problem.
  • MoE Routing — real router math: word → 16-d hashed embedding → seeded W_router (16×8) → softmax → top-K. The router-math panel shows the live computation.
  • Long Context — attention mask topology under sliding-window, strided, global-local patterns. The cost math is on the page.

Ship the stack:

See also