Regularization Techniques

Specific methods for keeping deep networks trainable and generalizable.

Dropout

Randomly zero p fraction of activations during training:

self.dropout = nn.Dropout(p=0.1)
x = self.dropout(x)

Effects:

  • Forces the network not to rely on any single neuron.
  • Approximates ensembling: each batch trains a different sub-network.
  • At inference, dropout is off and outputs are scaled accordingly (handled automatically by nn.Dropout).

Modern usage:

  • Transformers: p = 0.1 historically; many modern LLMs use p = 0 during pretraining.
  • Pre-LLM CNNs: p = 0.5 in fully-connected layers.

Dropout variants:

  • DropConnect: drop weights instead of activations.
  • Stochastic depth: drop entire layers (used in deep ResNets).
  • DropPath: variant of stochastic depth for transformers.

Batch normalization (BatchNorm)

Normalize activations across the batch dimension within each channel:

y = γ · (x − μ_batch) / √(σ²_batch + ε) + β

Where γ, β are learned per-channel.

Pros:

  • Faster training, less sensitive to initialization.
  • Acts as light regularization (reduces internal covariate shift).
  • Standard in vision (CNNs).

Cons:

  • Behavior differs between training and inference (inference uses running averages).
  • Sensitive to small batch sizes.
  • Distributed training requires SyncBatchNorm.
self.bn = nn.BatchNorm2d(num_features=64)

Layer normalization (LayerNorm)

Normalize across the feature dimension, independently per example:

self.ln = nn.LayerNorm(hidden_size)

Pros:

  • Independent of batch size — works at inference.
  • Standard in transformers (where batch sizes vary and sequences have variable length).
  • No need for running statistics.

Variants:

  • RMSNorm: skip mean centering, only scale by RMS. Slightly faster, used in LLaMA family.
  • Pre-norm vs post-norm: pre-norm (LN before sublayer) is more stable for very deep transformers.

Weight decay

L2 regularization — but in modern usage means decoupled weight decay (in AdamW), not gradient penalty.

optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.1)

Typical: 0.01 to 0.1 for transformers. Don’t apply to biases or normalization parameters (a quirk you’ll see in many codebases).

Gradient clipping

Cap the global gradient norm to prevent rare explosive batches from destabilizing training:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Almost always set to 1.0 for transformer training. Cheap insurance.

Label smoothing

Soften target labels: instead of one-hot, mix a small ε with uniform.

loss = F.cross_entropy(logits, targets, label_smoothing=0.1)
  • Reduces overconfidence.
  • Often improves accuracy and especially calibration.
  • Used in vision (ε ≈ 0.1) and some LLM training.

Early stopping

Monitor validation loss; stop when it plateaus.

if val_loss > best_val_loss + tol:
    patience += 1
    if patience >= max_patience:
        break

Effective and free.

Data augmentation

The strongest regularizer in vision and audio:

  • Vision: random crop, flip, color jitter, RandAugment, AutoAugment, Mixup, CutMix.
  • Audio: SpecAugment (mask time/frequency bands).
  • Tabular: harder; mostly engineered (e.g. SMOTE for class imbalance, Mixup).
  • Text: traditionally back-translation, EDA, word dropout. In the LLM era, mostly replaced by diverse pretraining data.

Mixup and CutMix

Mixup: linearly combine pairs of inputs and labels.

x' = λ x_i + (1−λ) x_j
y' = λ y_i + (1−λ) y_j

CutMix: paste a patch from another image.

Both improve generalization and calibration in vision. Less common in language tasks.

Knowledge distillation

Train a small “student” to match a large “teacher” model’s outputs (logits):

L = α · CE(student, true_labels) + (1−α) · KL(student / T, teacher / T) · T²

Where T is a temperature softening the teacher’s distribution. Distillation improves the student beyond what training on labels alone achieves.

In modern AI, distillation is how we get small models to behave like big ones (e.g. distilling from a 70B to a 7B).

Stochastic weight averaging (SWA / EMA)

Maintain a running average of weights during training:

w̄ ← α · w̄ + (1 − α) · w

At evaluation, use instead of w. Improves generalization for free.

In practice: EMA (Exponential Moving Average) is the modern variant, used in diffusion models routinely and increasingly in LLM training.

Spectral normalization

Constrain the largest singular value of each weight matrix. Used in GANs, sometimes in classifiers for adversarial robustness.

Weight tying

Share parameters across layers when it makes sense:

  • Tie input embedding and output projection in language models.
  • Tie encoder and decoder embeddings in seq2seq.

A regularizer disguised as a parameter-saving trick.

Modern view in foundation models

For very large LLM pretraining:

  • Dropout off or extremely low.
  • Weight decay 0.1.
  • Gradient clipping 1.0.
  • No label smoothing typically.
  • Strong implicit regularization from data scale + SGD noise.

Most “regularization knobs” are turned off; data quality and diversity carry the weight.

For fine-tuning, dropout sometimes comes back to prevent overfit on small datasets.

Practical advice

  1. Start with what your reference architecture uses. If you’re training a transformer, copy a transformer recipe.
  2. Don’t stack regularizers blindly. Add one at a time and measure.
  3. Watch for symptoms:
    • Train loss low, val loss high → more reg or more data.
    • Train loss won’t drop → less reg, more model capacity.
    • Training unstable → grad clip, lower LR, longer warmup.
  4. Always normalize. LayerNorm or RMSNorm in transformers, BatchNorm in CNNs. No modern architecture skips this.

See also