Activations & Initialization

Two unsexy choices that decide whether a deep network trains at all.

Why activations matter

Without a non-linearity between layers, stacked linear layers collapse:

W₂ (W₁ x) = (W₂ W₁) x = W' x

A 50-layer linear network is a 1-layer linear network in disguise. Activations break the linearity and unlock representational depth.

Sigmoid

σ(x) = 1 / (1 + e^{-x})

Outputs in (0, 1). Smooth, differentiable. Historical default.

Problems:

  • Saturates at extremes — derivative ≈ 0 → vanishing gradients.
  • Not zero-centered — biases gradient direction.
  • Expensive to compute (exp).

Used today for: binary classifier output (sigmoid + BCE) and gating in LSTMs/GRUs.

Tanh

tanh(x) = (e^x − e^{-x}) / (e^x + e^{-x})

Outputs in (-1, 1). Zero-centered, but still saturates. Found in older RNNs and some niche uses.

ReLU

relu(x) = max(0, x)

The default for ~10 years. Cheap, doesn’t saturate on the positive side, doesn’t suffer vanishing gradients in the active region.

Problem: dying ReLUs. A neuron that always outputs 0 has zero gradient — it’s stuck dead forever.

Variants:

  • Leaky ReLU: max(αx, x) for small α. Lets some gradient through on the negative side.
  • PReLU: leaky ReLU where α is learned.
  • ELU, SELU: smooth alternatives, sometimes train better.

GELU

GELU(x) = x · Φ(x)

Where Φ is the cumulative normal. Smooth, near-zero negative output.

GELU is the default in transformers (BERT, GPT-2, GPT-3). Slightly better than ReLU empirically; slightly more compute.

Swish / SiLU

SiLU(x) = x · σ(x)

Self-gated. Used in LLaMA, PaLM, EfficientNet. Smooth like GELU, similarly performant.

SwiGLU

A gated variant: (W₁ x) ⊙ SiLU(W₂ x). Used in LLaMA-2/3, modern open-source LLMs. Slightly better quality at the cost of an extra projection.

In practice, the feed-forward block of a modern transformer is:

def ff_block(x):
    return W_o(SwiGLU(x))

Softmax

softmax(z)_i = e^{z_i} / Σⱼ e^{z_j}

Not really an activation per se — it’s the output normalizer for classification. Outputs sum to 1 (a probability distribution). Used in attention’s similarity normalization too.

Practical detail: subtract the max before exponentiating to avoid overflow. PyTorch does this for you in nn.functional.softmax.

Choosing an activation

LayerDefault 2026
Hidden layers, MLPReLU (simple) or GELU/SiLU (transformer-style)
Final layer, classificationLinear (raw logits) → softmax via loss
Final layer, regressionLinear
LSTM/GRU gatesSigmoid + tanh
Transformer FFNGELU, SiLU, or SwiGLU

If you’re not sure, GELU is a fine default for any modern hidden layer.

Initialization

Imagine starting all weights at zero — every neuron in a layer learns the same thing. Symmetric, useless.

Initialize randomly, but with scale that keeps activation variance stable across layers.

Xavier / Glorot

For tanh/sigmoid:

W ~ Uniform(-√(6/(n_in+n_out)), √(6/(n_in+n_out)))

Or equivalent normal version. Maintains variance through linear layers with tanh.

He / Kaiming

For ReLU:

W ~ Normal(0, √(2/n_in))

Compensates for the “half the input is dead” property of ReLU.

PyTorch defaults are reasonable but check with torch.nn.init if in doubt:

nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")
nn.init.zeros_(layer.bias)

Specialized init for transformers

Modern transformers use scaled init:

  • Embedding: small (e.g. N(0, 0.02))
  • Output projection of each block: scaled by 1/√(2N) where N is layer count, to keep the residual stream well-behaved at depth

Get this wrong and a 100-layer transformer will diverge in the first few thousand steps.

Bias init

Usually zero. Exceptions: forget gate bias of LSTMs to 1.0 (helps memory), some specialized cases.

Visualizing the issue

Build a 50-layer MLP. Initialize with N(0, 1) (way too big). The activations explode by layer 10. Initialize with N(0, 0.001) (way too small). Activations vanish by layer 10. Use He init. Activations stay healthy.

This is one of those things where you should literally code it up and watch.

What about LayerNorm and BatchNorm?

These mostly make initialization less critical. By forcing each layer’s activations to have zero mean and unit variance, they paper over many init choices. That’s why modern transformers can use plain N(0, 0.02) everywhere — LayerNorm cleans it up.

Practical advice

  1. Use a sane default: He init for ReLU-family, Xavier for tanh, framework defaults for the rest.
  2. Watch activations during the first batch. If they explode or vanish, fix init before chasing other bugs.
  3. Don’t tune init. It rarely buys more than a fraction of a percent. Get it right enough and move on.

See also