Activations & Initialization
Two unsexy choices that decide whether a deep network trains at all.
Why activations matter
Without a non-linearity between layers, stacked linear layers collapse:
W₂ (W₁ x) = (W₂ W₁) x = W' x
A 50-layer linear network is a 1-layer linear network in disguise. Activations break the linearity and unlock representational depth.
Sigmoid
σ(x) = 1 / (1 + e^{-x})
Outputs in (0, 1). Smooth, differentiable. Historical default.
Problems:
- Saturates at extremes — derivative ≈ 0 → vanishing gradients.
- Not zero-centered — biases gradient direction.
- Expensive to compute (exp).
Used today for: binary classifier output (sigmoid + BCE) and gating in LSTMs/GRUs.
Tanh
tanh(x) = (e^x − e^{-x}) / (e^x + e^{-x})
Outputs in (-1, 1). Zero-centered, but still saturates. Found in older RNNs and some niche uses.
ReLU
relu(x) = max(0, x)
The default for ~10 years. Cheap, doesn’t saturate on the positive side, doesn’t suffer vanishing gradients in the active region.
Problem: dying ReLUs. A neuron that always outputs 0 has zero gradient — it’s stuck dead forever.
Variants:
- Leaky ReLU:
max(αx, x)for small α. Lets some gradient through on the negative side. - PReLU: leaky ReLU where α is learned.
- ELU, SELU: smooth alternatives, sometimes train better.
GELU
GELU(x) = x · Φ(x)
Where Φ is the cumulative normal. Smooth, near-zero negative output.
GELU is the default in transformers (BERT, GPT-2, GPT-3). Slightly better than ReLU empirically; slightly more compute.
Swish / SiLU
SiLU(x) = x · σ(x)
Self-gated. Used in LLaMA, PaLM, EfficientNet. Smooth like GELU, similarly performant.
SwiGLU
A gated variant: (W₁ x) ⊙ SiLU(W₂ x). Used in LLaMA-2/3, modern open-source LLMs. Slightly better quality at the cost of an extra projection.
In practice, the feed-forward block of a modern transformer is:
def ff_block(x):
return W_o(SwiGLU(x))
Softmax
softmax(z)_i = e^{z_i} / Σⱼ e^{z_j}
Not really an activation per se — it’s the output normalizer for classification. Outputs sum to 1 (a probability distribution). Used in attention’s similarity normalization too.
Practical detail: subtract the max before exponentiating to avoid overflow. PyTorch does this for you in
nn.functional.softmax.
Choosing an activation
| Layer | Default 2026 |
|---|---|
| Hidden layers, MLP | ReLU (simple) or GELU/SiLU (transformer-style) |
| Final layer, classification | Linear (raw logits) → softmax via loss |
| Final layer, regression | Linear |
| LSTM/GRU gates | Sigmoid + tanh |
| Transformer FFN | GELU, SiLU, or SwiGLU |
If you’re not sure, GELU is a fine default for any modern hidden layer.
Initialization
Imagine starting all weights at zero — every neuron in a layer learns the same thing. Symmetric, useless.
Initialize randomly, but with scale that keeps activation variance stable across layers.
Xavier / Glorot
For tanh/sigmoid:
W ~ Uniform(-√(6/(n_in+n_out)), √(6/(n_in+n_out)))
Or equivalent normal version. Maintains variance through linear layers with tanh.
He / Kaiming
For ReLU:
W ~ Normal(0, √(2/n_in))
Compensates for the “half the input is dead” property of ReLU.
PyTorch defaults are reasonable but check with torch.nn.init if in doubt:
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
nn.init.zeros_(layer.bias)
Specialized init for transformers
Modern transformers use scaled init:
- Embedding: small (e.g.
N(0, 0.02)) - Output projection of each block: scaled by
1/√(2N)where N is layer count, to keep the residual stream well-behaved at depth
Get this wrong and a 100-layer transformer will diverge in the first few thousand steps.
Bias init
Usually zero. Exceptions: forget gate bias of LSTMs to 1.0 (helps memory), some specialized cases.
Visualizing the issue
Build a 50-layer MLP. Initialize with N(0, 1) (way too big). The activations explode by layer 10. Initialize with N(0, 0.001) (way too small). Activations vanish by layer 10. Use He init. Activations stay healthy.
This is one of those things where you should literally code it up and watch.
What about LayerNorm and BatchNorm?
These mostly make initialization less critical. By forcing each layer’s activations to have zero mean and unit variance, they paper over many init choices. That’s why modern transformers can use plain N(0, 0.02) everywhere — LayerNorm cleans it up.
Practical advice
- Use a sane default: He init for ReLU-family, Xavier for tanh, framework defaults for the rest.
- Watch activations during the first batch. If they explode or vanish, fix init before chasing other bugs.
- Don’t tune init. It rarely buys more than a fraction of a percent. Get it right enough and move on.
See also
- Backpropagation — why gradient flow needs healthy activations
- Regularization techniques — BatchNorm/LayerNorm details
- Stage 06 — Transformer block — modern activation choices in context