Activation Functions Lab

Why this matters

Without a non-linear activation, stacking linear layers just gives you another linear layer — no matter how many. The activation function is what turns a stack of matrix multiplies into a universal function approximator.

The choice of activation also determines whether gradients can flow back. Sigmoid and tanh saturate — their derivatives are ~0 in the tails, so deep networks built on them stop learning. ReLU's flat positive slope and zero negative is what made deep learning practical in 2011. GELU and SiLU are smoother successors that train transformers slightly better.

Anchored to 03-neural-networks/activations-and-initialization.

Six activations, side by side

Why this matters