Frontier Architectures

What does a 2026 frontier LLM actually look like? Plenty of innovation has accumulated since the 2017 transformer. Here’s the rough picture, noting that the field is moving fast and specifics will date.

The default LLaMA-class recipe

Most modern open-weights models share a common architecture:

  • Decoder-only transformer
  • RMSNorm (pre-norm)
  • SwiGLU activation in the FFN
  • RoPE positional encoding
  • GQA (grouped-query attention) — typically 4–8 KV groups per 32+ Q heads
  • No biases in linear layers
  • Weight tying between input embedding and LM head
  • 8k–128k context (sometimes more), sometimes via YaRN extension

Examples: LLaMA-2/3, Mistral, Qwen 2/3, DeepSeek-V2/V3, Phi-3, Yi, Gemma.

A scratch-trained model that follows this recipe gets you in the ballpark of any open frontier model.

What frontier closed models likely do

Most details are speculation, but credible signals from papers, leaks, and behavior:

  • MoE for at least some frontier models (suggested by output behavior, expert leakage in early GPT-4 outputs).
  • Long context via hybrid attention + KV optimization.
  • Multi-stage training:
    1. Pretrain on a curated mix of text + code + math + multilingual + multimodal data.
    2. Mid-training annealing on high-quality data.
    3. SFT on instruction/chat.
    4. RLHF / RLAIF / DPO / Constitutional AI.
    5. Reasoning RL (for o-series, R1-style models).
    6. Tool use RL.
  • Caching infrastructure baked into serving.

Architecture variants beyond pure transformers

State-space models (SSMs) / Mamba

Recurrent-style architectures with parallelizable training. Linear in sequence length.

  • Mamba / Mamba-2 (Gu, Dao, 2023–2024): selective SSMs with attention-competitive quality.
  • Jamba (AI21, 2024): hybrid transformer + Mamba blocks. Long context with reasonable cost.

SSMs may dominate very long context regimes; for shorter contexts, transformers still win.

Hybrid attention/SSM

  • Jamba: mostly Mamba, occasional attention.
  • Zamba: Mamba blocks + occasional shared attention layer.

The pattern: cheap recurrent layers do the bulk; occasional attention layers handle precise retrieval.

RWKV

Receptance Weighted Key Value — RNN-style inference with parallel training. v6 / v7 are competitive at 7B–14B scale.

Diffusion language models

Generative models that denoise from noise to text. Slower per-token but parallel; some claim quality benefits. Still research-grade for general LLMs as of early 2026.

Multimodal frontier

By 2026, most frontier models are natively multimodal:

  • Text + image input (Claude, GPT-4o/4.1/4.5, Gemini, Qwen-VL, Llama 4)
  • Text + audio I/O (GPT-4o realtime voice, Gemini Live, Claude voice)
  • Text + image + audio + video (frontier closed models)
  • Image generation (some via attached diffusion models, some natively)

Architecture choices:

  • Encoder + projector + LLM (LLaVA-style — most open models).
  • Native multimodal token streams (frontier closed models use unified token spaces).
  • Mixture-of-experts per modality (some models).

We dive into multimodal in Stage 12.

Training data composition

Frontier model pretraining mixes (rough estimates from disclosures):

SourceShare
Web (filtered)50–70%
Code5–20%
Books5–10%
Academic / arxiv1–5%
Math1–5%
Multilingual5–20%
Synthetic / distilled5–30% (and growing)

The shift toward synthetic data is a major 2024–2026 trend. Phi-4 (Microsoft) trained heavily on synthetic; modern Qwen, DeepSeek, and Llama use significant synthetic data for math, code, and reasoning.

Post-training stack

After pretraining:

  1. Supervised Fine-Tuning (SFT) on curated instruction-response pairs.
  2. Preference optimization: DPO, IPO, ORPO, or RLHF/PPO for alignment.
  3. Reasoning RL (for reasoning models): RL on math/code/logic with verifiable rewards.
  4. Constitutional AI (Anthropic) or similar: model critiques and revises its own outputs.
  5. Targeted SFT for specific behaviors: tool use, refusal calibration, persona, format following.

The choice of post-training stack often matters more than architectural choices for final product quality.

Frontier model sizes (early 2026)

ClassParam count (active / total if MoE)
Phone / on-device1–3B
Edge / fast cloud7–13B
Mid frontier70–200B (some MoE)
Frontier closed200B–1T+ active equivalent

Models continue to scale, but distillation and post-training have closed the gap: a well-trained 70B can match a poorly-tuned 200B.

Compute and energy

Frontier training runs in 2025–2026:

  • 10²⁵–10²⁶ FLOPs
  • Tens of thousands of H100/H200/B200 GPUs
  • Months of wall-clock training time
  • $50M–$1B+ in compute costs

This is compute that only a handful of organizations can muster. The downstream open-weights ecosystem benefits via distillation and fine-tuning.

Where things are heading (informed speculation)

  1. Reasoning everywhere: every frontier model will have a reasoning mode within ~12 months (mid-2026).
  2. Native long context > 1M tokens as default.
  3. Better tool use: web browsing, code execution, multi-tool composition baked into training.
  4. Smaller models reach further: 7–14B models matching today’s 70B on most tasks.
  5. Multimodal as default: text-only models will be the niche.
  6. Continual learning: experimental but growing — models that update on user interactions safely.

Reading list

  • “Attention Is All You Need” (2017) — Vaswani et al.
  • “Scaling Laws for Neural Language Models” (2020) — Kaplan et al.
  • “Training Compute-Optimal Large Language Models” (2022) — Hoffmann et al. (Chinchilla)
  • “LLaMA: Open and Efficient Foundation Language Models” (2023)
  • “RoFormer: Enhanced Transformer with Rotary Position Embedding” (2021)
  • “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” (2023)
  • “DeepSeek-V3 Technical Report” (2024)
  • “Llama 3 Technical Report” (2024)

See also