Mixture of Experts (MoE)
A way to scale parameter count without scaling per-token compute. Each token is routed to a small number of “experts” — only those experts activate. The model is huge in total parameters, much smaller in active parameters per token.
The basic idea
Replace the dense FFN of a transformer block with E parallel FFNs (the experts) plus a small router:
Standard FFN block:
h = MLP(x)
MoE FFN block:
weights = softmax(router(x)) # over E experts
top_k_idxs = topk(weights, k=2)
h = Σ_(i ∈ top_k_idxs) weights[i] · MLP_i(x)
For each token, only the top-k experts (typically k = 1 or k = 2) compute. The rest sit idle.
Why it’s a big deal
- Parameters: ~E × the dense equivalent.
- Active parameters per token: roughly
k/Eof the total — typically 5–25%. - Per-token compute: roughly proportional to active params, not total.
So a 100B-parameter MoE with k=2, E=8 activates ~25B params per token — costing about as much per inference as a 25B dense model, but holding way more knowledge.
Real examples (2025):
- Mixtral 8×22B: 141B total, ~39B active per token.
- DeepSeek-V3: 671B total, 37B active per token.
- Grok-1: 314B total, ~25% active.
Routing
The router is a small linear layer that produces logits per expert. Softmax → distribution over experts → top-k selection.
Variants:
- Top-1 routing (Switch Transformer): cheapest, easier to train.
- Top-2 routing: smoother gradients, slightly more compute.
- Expert choice (Zhou et al. 2022): experts pick tokens, not the other way around — naturally balanced.
- Sparse mixer: deterministic structured routing.
The load-balancing problem
Naive routing collapses: a few popular experts get everything, others starve. To prevent this, training adds auxiliary losses:
- Load-balancing loss: penalize uneven expert utilization.
- Z-loss: keep router logits stable.
- Capacity factor: cap how many tokens any one expert can take.
DeepSeek and others have moved toward auxiliary-loss-free routing — using bias adjustments or expert-choice variants instead.
Why MoE works
Two intuitions:
- Specialization: different experts learn different concepts (one for code, one for medical text, etc.). The router picks the right specialist.
- Parameter efficiency: a single dense FFN must encode everything; MoE distributes the burden.
In practice, expert specialization is fuzzier than the intuition suggests — experts often learn semi-arbitrary partitions. But the empirical scaling holds: a well-trained MoE matches a dense model with ~3–5× more active parameters.
Inference challenges
MoE shines for training compute and storage flexibility, but creates inference headaches:
- Memory: must hold all experts in memory for fast switching.
- Latency: routing adds overhead; tokens in a batch may go to different experts on different GPUs.
- Distributed serving: sharding experts across GPUs requires custom kernels and careful batching.
- Cold experts: rarely-used experts waste memory.
Modern serving frameworks (vLLM, SGLang, TensorRT-LLM) have sophisticated MoE support; do-it-yourself MoE serving is an entire engineering project.
Sparse MoE in detail
For a transformer block:
class MoEFFN(nn.Module):
def __init__(self, d_model, mlp_dim, num_experts, k=2):
super().__init__()
self.experts = nn.ModuleList([MLP(d_model, mlp_dim) for _ in range(num_experts)])
self.router = nn.Linear(d_model, num_experts, bias=False)
self.k = k
def forward(self, x):
# x: (B, T, d_model). Flatten tokens for routing.
B, T, C = x.shape
x_flat = x.view(B*T, C)
logits = self.router(x_flat) # (B*T, E)
weights, idxs = logits.topk(self.k, dim=-1)
weights = F.softmax(weights, dim=-1)
# Dispatch each token to its top-k experts (simplified, not efficient)
out = torch.zeros_like(x_flat)
for k_i in range(self.k):
for e_idx in range(len(self.experts)):
mask = idxs[:, k_i] == e_idx
if mask.any():
out[mask] += weights[mask, k_i:k_i+1] * self.experts[e_idx](x_flat[mask])
return out.view(B, T, C)
In real implementations, dispatching is done with custom CUDA kernels for speed.
Shared experts
A modern wrinkle (DeepSeek-MoE, Qwen-MoE): every token also goes through a small set of shared experts (always active), in addition to the routed sparse experts. This guarantees baseline competence even if routing fails.
When MoE wins
- High-quality, big models where active compute is the constraint.
- Open-weights models where total params signal capacity but inference cost matters.
- Domain-mixed deployments where expert specialization aligns with use cases.
When MoE loses
- Small models (under ~10B total params) — overhead exceeds benefit.
- Low-batch inference where routing latency dominates.
- Memory-constrained edge — total params must fit in RAM.
Decision flow
Need maximum quality at fixed inference compute?
→ MoE (if you can serve it)
Need predictable latency, edge deployment, or simple infra?
→ Dense
Researcher comparing methods?
→ Pick what your hypothesis tests; report both