Regularization & Generalization

Generalization is the only thing that matters. A model that achieves 0% training error and 50% test error is worthless. Regularization is the toolkit for closing that gap.

The bias-variance tradeoff

Decompose expected error into:

E[error] = bias² + variance + irreducible noise

Bias: how wrong the model is in expectation. Comes from a hypothesis class too simple to capture the truth.
Variance: how much the model’s predictions wobble across different training sets. Comes from a hypothesis class too flexible — it memorizes noise.
Irreducible noise: real label noise / aleatoric uncertainty.

Underfitting = high bias. Overfitting = high variance. Most ML knobs trade these against each other.

Modern deep learning partially escapes the classical curve. With enough data and the right architecture, very large models generalize better than smaller ones — the double descent phenomenon. The bias-variance frame still applies; the regime is different.

What overfitting looks like

Training loss falls; validation loss stops or rises.
Test set predictions wildly disagree with each other across small data perturbations.
The model memorizes specific training examples (you can detect this with membership inference attacks).

The fix is always one or more of: more data, simpler model, more regularization, better augmentation.

L1 and L2 regularization

Add a penalty on weight magnitudes to the loss.

L2 (ridge / weight decay): λ Σ wᵢ². Smooth shrinkage. Default for neural networks.
L1 (lasso): λ Σ |wᵢ|. Drives weights to exact zero — feature selection as a side effect.
Elastic net: weighted sum of both.

For transformers: weight decay typically 0.01–0.1. AdamW applies it correctly (decoupled from gradient updates).

Dropout

Randomly zero a fraction p of activations during training. At inference, scale or use the full network.

self.dropout = nn.Dropout(p=0.1)
x = self.dropout(x)

Prevents co-adaptation of neurons.
Acts like ensembling: each mini-batch trains a different sub-network.
p=0.1 for transformers; up to 0.5 for older fully-connected nets.

Largely replaced by other techniques in modern LLMs (which often use dropout=0 during pretraining).

Early stopping

Stop training when validation loss plateaus or rises. Cheap, effective. Works as implicit regularization.

if val_loss < best_val:
    best_val = val_loss
    patience_counter = 0
    save_checkpoint()
else:
    patience_counter += 1
    if patience_counter >= patience:
        break

Data augmentation

The cheapest, most effective regularizer.

Vision: random crops, flips, color jitter, RandAugment, AutoAugment, Mixup, CutMix.
Text: back-translation, synonym replacement, word dropout, EDA. (Less common; pretraining replaced most needs.)
Audio: SpecAugment, time/frequency masking.

“Want a more robust model? Generate more data — for free — by augmenting the training set.”

In modern LLM training, the equivalent is diverse, high-quality, carefully-mixed data. There’s no augmentation per se, but the data mixture is the primary lever.

Batch normalization & layer normalization

Normalize activations during training. Side effect: easier optimization and a regularization-like reduction in covariate shift.

BatchNorm: normalize across the batch dimension. Default for vision.
LayerNorm: normalize across features for each example. Default for transformers (because batch sizes vary, e.g. during inference).
RMSNorm: simpler variant of LayerNorm without mean centering. Used in LLaMA-style models.
GroupNorm: middle ground; useful in vision when batch is small.

Place norms before (pre-norm) or after (post-norm) the residual connection. Pre-norm is more stable for very deep transformers.

Label smoothing

Replace one-hot labels [0,0,1,0] with [0.025, 0.025, 0.925, 0.025]. Reduces overconfidence; usually improves calibration and slightly improves accuracy.

loss = F.cross_entropy(logits, targets, label_smoothing=0.1)

Weight tying

In language models, share the input embedding matrix and the output projection. Halves parameters; usually slightly improves perplexity. Almost universal.

Gradient clipping

Cap the L2 norm of the gradient at a threshold. Prevents occasional explosions from one bad batch.

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Architectural inductive biases

The strongest “regularizer” of all is choosing an architecture that matches the data:

Convolutions for images (translation equivariance).
Attention for sequences (position-agnostic, content-based).
Graph networks for graphs.
Equivariant networks for 3D molecular data.

A right-fit architecture trains faster, needs less data, and generalizes better than brute force.

Modern view: scale and data quality

Once you’re training a 7B+ model on hundreds of billions of tokens:

Most classical regularization knobs (dropout, label smoothing) are turned off or way down.
The implicit regularization from SGD itself does a lot of work.
Data quality and diversity become the dominant lever.
Curriculum and data mixture matter more than tuning λ on weight decay.

Generalization beyond i.i.d.

The classical theory assumes train and test come from the same distribution. In practice:

Distribution shift happens (covariate shift, label shift, concept drift).
Out-of-distribution (OOD) generalization is harder than in-distribution.
Compositional generalization — does the model handle novel combinations of seen primitives? — is a frontier concern.

Foundation models help here: large pretraining data covers more of the input space, so “test” inputs are usually closer to “train” than they would be for a small narrow model.

Pitfalls

Tuning regularization on the test set. Always use validation.
Stacking too many regularizers. Each one underfits a little; together they can crush the model. Add one at a time.
Treating dropout as a fix for bad data. It isn’t. Fix the data.
Ignoring the bias side. If validation accuracy is bad and training accuracy is also bad, you have a bias problem — more model, more features, better architecture, not more regularization.