Mixture of Experts (MoE)

A way to scale parameter count without scaling per-token compute. Each token is routed to a small number of “experts” — only those experts activate. The model is huge in total parameters, much smaller in active parameters per token.

The basic idea

Replace the dense FFN of a transformer block with E parallel FFNs (the experts) plus a small router:

Standard FFN block:
    h = MLP(x)

MoE FFN block:
    weights = softmax(router(x))                    # over E experts
    top_k_idxs = topk(weights, k=2)
    h = Σ_(i ∈ top_k_idxs) weights[i] · MLP_i(x)

For each token, only the top-k experts (typically k = 1 or k = 2) compute. The rest sit idle.

Why it’s a big deal

  • Parameters: ~E × the dense equivalent.
  • Active parameters per token: roughly k/E of the total — typically 5–25%.
  • Per-token compute: roughly proportional to active params, not total.

So a 100B-parameter MoE with k=2, E=8 activates ~25B params per token — costing about as much per inference as a 25B dense model, but holding way more knowledge.

Real examples (2025):

  • Mixtral 8×22B: 141B total, ~39B active per token.
  • DeepSeek-V3: 671B total, 37B active per token.
  • Grok-1: 314B total, ~25% active.

Routing

The router is a small linear layer that produces logits per expert. Softmax → distribution over experts → top-k selection.

Variants:

  • Top-1 routing (Switch Transformer): cheapest, easier to train.
  • Top-2 routing: smoother gradients, slightly more compute.
  • Expert choice (Zhou et al. 2022): experts pick tokens, not the other way around — naturally balanced.
  • Sparse mixer: deterministic structured routing.

The load-balancing problem

Naive routing collapses: a few popular experts get everything, others starve. To prevent this, training adds auxiliary losses:

  • Load-balancing loss: penalize uneven expert utilization.
  • Z-loss: keep router logits stable.
  • Capacity factor: cap how many tokens any one expert can take.

DeepSeek and others have moved toward auxiliary-loss-free routing — using bias adjustments or expert-choice variants instead.

Why MoE works

Two intuitions:

  1. Specialization: different experts learn different concepts (one for code, one for medical text, etc.). The router picks the right specialist.
  2. Parameter efficiency: a single dense FFN must encode everything; MoE distributes the burden.

In practice, expert specialization is fuzzier than the intuition suggests — experts often learn semi-arbitrary partitions. But the empirical scaling holds: a well-trained MoE matches a dense model with ~3–5× more active parameters.

Inference challenges

MoE shines for training compute and storage flexibility, but creates inference headaches:

  1. Memory: must hold all experts in memory for fast switching.
  2. Latency: routing adds overhead; tokens in a batch may go to different experts on different GPUs.
  3. Distributed serving: sharding experts across GPUs requires custom kernels and careful batching.
  4. Cold experts: rarely-used experts waste memory.

Modern serving frameworks (vLLM, SGLang, TensorRT-LLM) have sophisticated MoE support; do-it-yourself MoE serving is an entire engineering project.

Sparse MoE in detail

For a transformer block:

class MoEFFN(nn.Module):
    def __init__(self, d_model, mlp_dim, num_experts, k=2):
        super().__init__()
        self.experts = nn.ModuleList([MLP(d_model, mlp_dim) for _ in range(num_experts)])
        self.router = nn.Linear(d_model, num_experts, bias=False)
        self.k = k

    def forward(self, x):
        # x: (B, T, d_model). Flatten tokens for routing.
        B, T, C = x.shape
        x_flat = x.view(B*T, C)

        logits = self.router(x_flat)              # (B*T, E)
        weights, idxs = logits.topk(self.k, dim=-1)
        weights = F.softmax(weights, dim=-1)

        # Dispatch each token to its top-k experts (simplified, not efficient)
        out = torch.zeros_like(x_flat)
        for k_i in range(self.k):
            for e_idx in range(len(self.experts)):
                mask = idxs[:, k_i] == e_idx
                if mask.any():
                    out[mask] += weights[mask, k_i:k_i+1] * self.experts[e_idx](x_flat[mask])
        return out.view(B, T, C)

In real implementations, dispatching is done with custom CUDA kernels for speed.

Shared experts

A modern wrinkle (DeepSeek-MoE, Qwen-MoE): every token also goes through a small set of shared experts (always active), in addition to the routed sparse experts. This guarantees baseline competence even if routing fails.

When MoE wins

  • High-quality, big models where active compute is the constraint.
  • Open-weights models where total params signal capacity but inference cost matters.
  • Domain-mixed deployments where expert specialization aligns with use cases.

When MoE loses

  • Small models (under ~10B total params) — overhead exceeds benefit.
  • Low-batch inference where routing latency dominates.
  • Memory-constrained edge — total params must fit in RAM.

Decision flow

Need maximum quality at fixed inference compute?
    → MoE (if you can serve it)

Need predictable latency, edge deployment, or simple infra?
    → Dense

Researcher comparing methods?
    → Pick what your hypothesis tests; report both

See also