demo
Six activations, side by side
The non-linearity is what makes a neural network neural. Compare ReLU, Leaky ReLU, GELU, SiLU, tanh, and sigmoid on one plot — including their derivatives, which is what backprop actually multiplies through.
Why this matters
Without a non-linear activation, stacking linear layers just gives you another linear layer — no matter how many. The activation function is what turns a stack of matrix multiplies into a universal function approximator.
The choice of activation also determines whether gradients can flow back. Sigmoid and tanh saturate — their derivatives are ~0 in the tails, so deep networks built on them stop learning. ReLU's flat positive slope and zero negative is what made deep learning practical in 2011. GELU and SiLU are smoother successors that train transformers slightly better.
Anchored to 03-neural-networks/activations-and-initialization.