Ch 3: Optimizers & Learning Rates — Deep Learning Fundamentals

trending_down

Vanilla SGD & Its Limits

The simplest optimizer and why it struggles

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) updates weights using the gradient from a single mini-batch rather than the full dataset. It’s fast per step but noisy — each batch gives a slightly different gradient estimate. This noise can help escape shallow local minima but also causes the loss to oscillate rather than smoothly decrease. SGD treats all parameters equally and has no memory of previous gradients.

The Update

// Vanilla SGD w = w - η · ∂L/∂w // Problems: // 1. Same learning rate for ALL parameters // 2. Oscillates in steep dimensions // 3. Slow in flat dimensions // 4. Gets stuck in saddle points

Key insight: In high-dimensional loss landscapes (millions of parameters), saddle points are far more common than local minima. SGD can stall at saddle points because the gradient is near zero in many directions.

speed

Momentum

Adding inertia to gradient descent

The Physics Analogy

Momentum (Polyak, 1964) adds a velocity term that accumulates past gradients. Think of a ball rolling downhill: it builds speed in consistent directions and dampens oscillations in noisy directions. The momentum coefficient β (typically 0.9) controls how much history to retain. With momentum, SGD accelerates through flat regions and smooths out zigzag paths in narrow valleys.

The Update

// SGD with Momentum v = β · v + ∂L/∂w // accumulate velocity w = w - η · v // update weights // β = 0.9 means 90% of previous velocity // is carried forward // Nesterov variant (1983): look ahead first v = β · v + ∂L/∂(w - β·v) w = w - η · v

Key insight: Nesterov momentum (Nesterov, 1983) computes the gradient at the “look-ahead” position rather than the current position. This gives a corrective signal that reduces overshooting and converges faster in practice.

auto_graph

Adaptive Learning Rates: AdaGrad & RMSProp

Different rates for different parameters

AdaGrad (Duchi et al., 2011)

AdaGrad gives each parameter its own learning rate based on its gradient history. Parameters with large past gradients (frequent features) get smaller steps; parameters with small past gradients (rare features) get larger steps. This is great for sparse data (NLP, recommendations) but has a fatal flaw: the accumulated squared gradients grow monotonically, eventually shrinking the learning rate to near zero.

RMSProp (Hinton, 2012)

RMSProp, proposed by Geoffrey Hinton in his Coursera lectures (2012, unpublished), fixes AdaGrad by using an exponential moving average of squared gradients instead of a cumulative sum. This prevents the learning rate from decaying to zero and adapts to the recent gradient landscape.

The Updates

// RMSProp s = ρ · s + (1-ρ) · (∂L/∂w)² // EMA of sq grads w = w - η · ∂L/∂w / (√s + ε) // adaptive step // ρ = 0.99, ε = 1e-8 (stability) // Large recent grads → smaller step // Small recent grads → larger step

star

Adam: The Default Optimizer

Combining momentum and adaptive rates

Kingma & Ba (ICLR 2015)

Adam (Adaptive Moment Estimation) combines the best of momentum and RMSProp. It maintains two running averages: m (first moment — mean of gradients, like momentum) and v (second moment — mean of squared gradients, like RMSProp). It also includes bias correction to account for the fact that m and v are initialized at zero and are biased toward zero in early steps. Adam is the most widely used optimizer in deep learning.

Rule of thumb: Adam with lr=3e-4 is a strong default for most tasks. It was the default optimizer for training GPT-2, GPT-3, and many other large language models.

The Algorithm

// Adam (Kingma & Ba, 2015) m = β₁ · m + (1-β₁) · g // 1st moment v = β₂ · v + (1-β₂) · g² // 2nd moment m̂ = m / (1 - β₁ᵗ) // bias correct v̂ = v / (1 - β₂ᵗ) // bias correct w = w - η · m̂ / (√v̂ + ε) // update // Defaults: β₁=0.9, β₂=0.999, ε=1e-8 // t = timestep (for bias correction)

build

Adam Variants

AdamW, RAdam, and the weight decay fix

AdamW (Loshchilov & Hutter, 2019)

Original Adam applies L2 regularization by adding the penalty to the gradient, but this interacts poorly with adaptive learning rates. AdamW (ICLR 2019) decouples weight decay from the gradient update, applying it directly to the weights. This seemingly small change significantly improves generalization. AdamW is the standard optimizer for training transformers, including BERT, GPT-3, and LLaMA.

Key Variants

// AdamW: decoupled weight decay w = w - η · (m̂/(√v̂ + ε) + λ·w) // vs. Adam with L2 (wrong way) g = g + λ·w // decay in gradient w = w - η · m̂/(√v̂ + ε) // RAdam (Liu et al., 2020) // Rectifies variance in early training // Removes need for warmup heuristic

Key insight: The difference between Adam and AdamW looks trivial in code but has a profound effect. AdamW ensures that weight decay penalizes large weights equally regardless of the adaptive learning rate, leading to better-regularized models.

schedule

Learning Rate Schedules

Changing the learning rate during training

Why Schedule?

A fixed learning rate is rarely optimal. Early in training, you want large steps to make fast progress. Later, you want small steps to fine-tune. Learning rate schedules systematically reduce η over time. The most common strategies: step decay (halve every N epochs), cosine annealing (smooth cosine curve from max to min), and warmup + decay (start small, ramp up, then decay).

Common Schedules

// Step decay η = η₀ · 0.1^(epoch // 30) // Cosine annealing (Loshchilov & Hutter, 2017) η = η_min + 0.5·(η_max - η_min)·(1 + cos(π·t/T)) // Linear warmup + cosine decay (standard for LLMs) if t < warmup_steps: η = η_max · (t / warmup_steps) else: η = cosine_decay(t)

Key insight: Warmup is critical for Adam-family optimizers. Liu et al. (2020) showed that Adam has problematically large variance in early training steps. Warmup acts as a variance reduction technique, stabilizing the adaptive learning rate estimates.

compare

Optimizer Comparison

When to use which optimizer

Decision Guide

AdamW is the default for transformers and most modern architectures. SGD + momentum often generalizes better for CNNs (ResNets on ImageNet) but requires more tuning. Adam converges faster but can generalize worse than SGD in some settings. For LLM training, AdamW with linear warmup + cosine decay is the near-universal standard (used by GPT-3, LLaMA, Gemini).

AdamW (Transformers / LLMs)

Fast convergence, minimal tuning, decoupled weight decay. Standard for BERT, GPT, LLaMA, and most NLP/LLM work.

SGD + Momentum (Vision)

Often better final accuracy on image tasks. Requires careful LR tuning. Used for ResNet, EfficientNet training.

Quick Reference

// PyTorch optimizer setup import torch.optim as optim // For transformers / LLMs optimizer = optim.AdamW( model.parameters(), lr=3e-4, weight_decay=0.01 ) // For CNNs / vision optimizer = optim.SGD( model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4 )

check_circle

Practical Tips & What’s Next

Battle-tested advice for choosing and tuning optimizers

Practical Guidelines

1. Start with AdamW (lr=3e-4, wd=0.01) as your baseline. 2. Use linear warmup for the first 5–10% of training steps. 3. Use cosine annealing to decay to ~10% of peak LR. 4. If training is unstable, reduce LR or increase warmup. 5. For fine-tuning pretrained models, use 10–100× smaller LR than pretraining. 6. Monitor both training and validation loss — divergence means overfitting.

The connection: Optimizers determine how efficiently a network learns. With training mechanics covered (loss, backprop, optimizers), the next chapter shifts to architecture: Convolutional Neural Networks — the breakthrough that taught machines to see.

The Evolution

// The optimizer family tree SGD (1951) ├─ + Momentum (1964, Polyak) │ └─ Nesterov (1983) ├─ AdaGrad (2011, Duchi) │ └─ RMSProp (2012, Hinton) │ └─ Adam (2015, Kingma & Ba) │ ├─ AdamW (2019, Loshchilov) │ └─ RAdam (2020, Liu) └─ LAMB, LARS // for large-batch training

Ch 3 — Optimizers & Learning Rates