07

stage · curriculum

Modern LLMs

What changed since 2017: scaling laws, mixture of experts, reasoning models, long context. Pick a model size, architecture, and tier for a given budget and use case — without this, you'll over- or under-buy.

6 articles
28 min to read
4 demos
4 books
if you only do one thing

Chinchilla's tokens-per-parameter rule is the most important non-architectural insight in modern ML. Drag the sliders; watch loss, FLOPs, and dollars retween in real time.

Articles in this stage

  1. 01 Field report: DeepSeek-R1 — reasoning from pure RL, in the open field report
  2. 02 Frontier Architectures
  3. 03 Long Context
  4. 04 Mixture of Experts (MoE)
  5. 05 Reasoning Models
  6. 06 Scaling Laws

Stage 07 — Modern LLMs

The transformer of 2017 vs the transformer of 2026 is the same skeleton with a lot of new muscle. This stage covers what changed: scaling laws, sparse activation (MoE), reasoning models, long context, and the broader architectural landscape.

Prerequisites

  • Stage 06 (transformers)

Learning ladder

  1. Scaling laws — Chinchilla, compute-optimal training
  2. Mixture of Experts — sparse activation, routing
  3. Reasoning models — o-series, R1, test-time compute
  4. Long context — 1M+ tokens, attention variants, retrieval interplay
  5. Frontier architectures — what 2026 frontier models look like

MVU

You can:

  • State the Chinchilla rule of thumb (≈20 tokens per parameter) and explain why it matters
  • Describe how MoE changes parameter count vs active parameters
  • Distinguish a reasoning model from a non-reasoning model and explain when to use each
  • Pick a model size and architecture for a given budget and use case

Exercise

Run the same hard prompt (e.g. a multi-step math word problem) through three models:

  • A small fast model (e.g. Haiku 4.5)
  • A frontier non-reasoning model (e.g. Sonnet 4.6)
  • A frontier reasoning model (e.g. an o-series model or Claude with extended thinking)

Compare:

  • Output quality
  • Latency
  • Cost
  • Confidence on simple sub-questions

Field reports — real-world case studies

Hands-on companions

After the theory, three concrete next stops:

Watch it interactively:

  • Scaling Laws — drag log10(N) and log10(D); watch loss, FLOPs, $, tokens/param all retween live with the real Chinchilla constants.
  • Reasoning Models — discrete reasoning_effort slider (none/low/medium/high) showing real-feeling reasoning traces, token counts, latency, and the predicted answer flip on the bat-and-ball problem.
  • MoE Routing — real router math: word → 16-d hashed embedding → seeded W_router (16×8) → softmax → top-K. The router-math panel shows the live computation.
  • Long Context — attention mask topology under sliding-window, strided, global-local patterns. The cost math is on the page.

Ship the stack:

See also

Further reading

Books move slower than papers in this field — treat these as foundations, not replacements for the latest research. Real authors, real publishers, real editions. Free badges mark books with author-authorized full text online.

  1. Hands-On Large Language Models cover

    Hands-On Large Language Models

    Jay Alammar, Maarten Grootendorst

    O'Reilly, 2024

    Visual, practical, including Alammar's classic Illustrated Transformer diagrams in book form.

  2. Build a Large Language Model From Scratch cover

    Build a Large Language Model From Scratch

    Sebastian Raschka

    Manning, 2024

    Implements GPT-2 end-to-end in PyTorch, layer by layer.

  3. Natural Language Processing with Transformers cover

    Natural Language Processing with Transformers

    Lewis Tunstall, Leandro von Werra, Thomas Wolf

    O'Reilly, Revised ed., 2023

    The HuggingFace book. Architecture-first treatment with working code.