demo

Center, rescale, ship

Slide the input distribution. Watch a messy activation get centered, normalized to unit variance, then rescaled by learned γ and β. The reason every transformer block trains stably.

Try this — predict before you click

  1. Drag scale from 1 to 5 with γ = 1, β = 0. Predict: the raw strip's std bar grows to ~5, the normalized strip's std stays at exactly 1.0 (that's the whole point), and the rescaled strip matches the normalized one because γ = 1.
  2. Same setup but drag γ to 0. Predict: the rescaled output collapses to a flat line at β. This is the failure mode — γ encodes "how much variance to put back in," and γ = 0 zeroes the layer. Real models initialize γ ≈ 1 and rarely let it drift far from there.
  3. Drag drift to 3 and γ = 1, β = 0. Predict: raw mean is ~3, normalized mean is exactly 0, rescaled mean equals β. LayerNorm doesn't care what the input mean is — it always centers.
  4. Crank β to −2 with γ = 1. Predict: the rescaled distribution shifts down by 2 regardless of input drift. β is a learnable bias that the gradient can move; you've just simulated "this layer wants its outputs centered at -2."

Anchored to 03-neural-networks/regularization-techniques and 06-transformers/transformer-block.