Optimizers

The optimizer turns gradients into parameter updates. SGD is the foundation; Adam and its descendants dominate modern practice.

SGD (Stochastic Gradient Descent)

θ ← θ − η · ∇L(θ)

The most basic optimizer. With a tuned learning rate and a good schedule, SGD trains state-of-the-art CNNs.

optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

SGD with momentum

v ← β · v + ∇L(θ)
θ ← θ − η · v

Momentum smooths out gradient noise and accelerates convergence in narrow valleys.

optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

β = 0.9 is the universal default. Nesterov momentum is a slight variant — sometimes a tiny improvement.

RMSprop

Maintains a running average of squared gradients per parameter. Divides updates by its square root, scaling each parameter’s learning rate adaptively.

v ← β · v + (1 − β) · g²
θ ← θ − η · g / (√v + ε)

A precursor to Adam. Still used in some RL setups.

Adam

Combines momentum (first moment of gradient) with adaptive scaling (second moment).

m ← β₁ m + (1 − β₁) g
v ← β₂ v + (1 − β₂) g²
m̂ = m / (1 − β₁^t)            # bias correction
v̂ = v / (1 − β₂^t)
θ ← θ − η · m̂ / (√v̂ + ε)

The default optimizer for most deep learning. Defaults: β₁=0.9, β₂=0.999, η=1e-3, ε=1e-8.

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

AdamW

The default for transformers.

Adam with decoupled weight decay — weight decay is applied directly to θ rather than added to the gradient (which makes the effective decay depend on the second moment, distorting it).

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1,
)

For LLM training: lr ~ 1e-4 to 6e-4, β₂ ~ 0.95, weight_decay ~ 0.1.

Lion (2023)

Discovered by symbolic search. Uses only the sign of the momentum:

update = sign(β₁ · m_prev + (1 − β₁) · g)
θ ← θ − η · (update + λ · θ)
m ← β₂ m + (1 − β₂) g

Memory-efficient: stores only the first moment, not the second. Often matches or slightly beats AdamW with ~3× smaller learning rate.

Sophia, Shampoo, second-order methods

Second-order optimizers approximate the Hessian to take more informed steps. Historically too expensive for large models. Recent work (Sophia, Shampoo, distributed Shampoo) makes them practical at scale and shows ~2× faster convergence on LLM pretraining. Still niche but gaining ground in 2026.

Adafactor

Like Adam but stores second moments in factored form (rank-1 approximation), saving substantial memory. Used for very large model training (T5, PaLM).

Learning rate schedules

The learning rate is rarely constant.

Linear warmup

Start at 0, ramp linearly to peak over the first ~1–10% of training. Critical for transformers — without warmup, attention can produce huge gradients early.

Cosine decay

After warmup, decay to a small fraction (e.g. 10% of peak) following a cosine curve.

η_t = η_min + 0.5 · (η_max − η_min) · (1 + cos(π · t / T))

Modern LLM training uses linear warmup → cosine decay. Default.

Step / multistep

Drop the learning rate by 10× at fixed milestones. Old-school CNN training (ImageNet ResNet recipes used this).

One-cycle

Warmup to peak, then linearly decay below the starting LR. “Super-convergence” for short trainings.

Per-layer learning rates

Different layers can use different LRs:

  • Discriminative fine-tuning: lower LR for early layers.
  • LoRA: only the adapter weights have nonzero LR.

PyTorch supports this via parameter groups:

optimizer = torch.optim.AdamW([
    {"params": model.encoder.parameters(), "lr": 1e-5},
    {"params": model.head.parameters(), "lr": 1e-3},
])

Gradient accumulation

When you can’t fit your desired batch size in memory, accumulate gradients over multiple smaller batches before stepping:

for i, batch in enumerate(loader):
    loss = compute_loss(batch) / accum_steps
    loss.backward()
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Effective batch size = micro-batch × accum_steps × world_size.

Mixed precision

Train in bf16 (preferred) or fp16 with fp32 master weights. ~2× speed and memory savings.

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast(dtype=torch.bfloat16):
    loss = compute_loss(batch)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

For most modern hardware, bf16 is the default — no GradScaler needed, no underflow issues.

Picking an optimizer

ScenarioPick
Vision (CNNs from scratch)SGD + momentum, cosine LR
Transformers, LLMsAdamW, linear warmup + cosine
Memory-constrained LLMAdafactor, Lion
Quick experimentsAdamW with default LR
RL, where gradients are noisyAdam, sometimes RMSprop

Hyperparameter ranges (modern defaults)

  • AdamW: lr ~ 1e-4 to 3e-4, betas = (0.9, 0.95), weight_decay = 0.1
  • SGD+momentum: lr ~ 0.01 to 0.1, momentum = 0.9, weight_decay = 1e-4
  • Lion: lr ~ 1e-5 to 1e-4 (much smaller than AdamW)

Practical advice

  1. Tune the learning rate first. Other knobs are second-order.
  2. Use warmup for transformers. No exceptions.
  3. Watch the first 1000 steps. If loss explodes or stalls, the LR is wrong.
  4. AdamW for almost everything. Don’t second-guess unless you have a reason.
  5. Save checkpoints. Optimizer state is part of the checkpoint, not just weights.

See also