Ch 6 — Optimization & Gradient Descent

Blindfolded on a mountain — feeling the slope and stepping downhill
Calculus
visibility_off
Blindfold
arrow_forward
tune
Learn Rate
arrow_forward
casino
SGD
arrow_forward
speed
Momentum
arrow_forward
auto_awesome
Adam
arrow_forward
warning
Traps
arrow_forward
rocket_launch
Practice
-
Click play or press Space to begin...
Step- / 8
visibility_off
Blindfolded on a Mountain
Gradient descent: feel the slope, step downhill, repeat
The Analogy
You’re blindfolded on a mountain and need to reach the valley. You can’t see, but you can feel the slope under your feet. Strategy: feel which direction goes downhill, take a step that way, repeat. That’s gradient descent. The gradient tells you the slope. You step in the opposite direction (downhill). The learning rate is your step size.
Key insight: Every AI model ever trained uses some form of this algorithm. GPT-4, Stable Diffusion, AlphaFold — all of them are blindfolded hikers feeling the slope and stepping downhill, billions of times, until they find a good valley.
The Algorithm
# Vanilla gradient descent # w = w - lr × ∇L(w) w = torch.randn(100, requires_grad=True) lr = 0.01 for step in range(1000): loss = compute_loss(w) # elevation loss.backward() # feel slope with torch.no_grad(): w -= lr * w.grad # step downhill w.grad.zero_() # reset compass
Real World
Blindfolded hiker: feel slope, step downhill, repeat
In AI
Compute gradient, update weights, repeat for millions of steps
tune
Learning Rate — Step Size Matters
Too big and you overshoot. Too small and you’re there forever.
The Analogy
The learning rate is your step size on the mountain. Too large: you leap over the valley and end up on the opposite slope, bouncing back and forth (divergence). Too small: you take baby steps and it takes forever to reach the valley. Just right: you descend efficiently and settle into the minimum.
Key insight: The learning rate is often the single most important hyperparameter in deep learning. A 10× change in learning rate can mean the difference between a model that converges beautifully and one that completely fails to train. This is why learning rate schedules (warmup, cosine decay) are so important.
Worked Example
# f(x) = x² — minimum at x=0 # gradient: f'(x) = 2x # Too large (lr=1.1): DIVERGES x = 5.0 # x = 5 - 1.1×10 = -6.0 (overshot!) # x = -6 - 1.1×(-12) = 7.2 (worse!) # Too small (lr=0.001): SLOW x = 5.0 # x = 5 - 0.001×10 = 4.99 # After 1000 steps: x ≈ 0.67 (still far) # Just right (lr=0.1): CONVERGES x = 5.0 # x = 5 - 0.1×10 = 4.0 # x = 4 - 0.1×8 = 3.2 # After 50 steps: x ≈ 0.0 ✓
Too Large
Leap over the valley, bounce wildly, diverge
Just Right
Steady descent, efficient convergence to minimum
casino
Stochastic Gradient Descent (SGD)
Using a noisy compass instead of a perfect one
The Analogy
Computing the gradient on the entire dataset is like surveying the entire mountain before each step — accurate but impossibly slow. SGD uses a small random sample (mini-batch) to estimate the gradient. It’s like feeling the slope with just your left foot instead of mapping the whole hillside. The estimate is noisy but fast, and the noise actually helps escape bad local minima.
Key insight: The “stochastic” noise in SGD is a feature, not a bug. It helps the optimizer escape sharp, narrow minima (which generalize poorly) and settle into wide, flat minima (which generalize well). This is why SGD often finds better solutions than exact gradient descent.
Worked Example
# Full gradient: average over ALL data # ∇L = (1/N) Σ ∇Lᵢ — expensive for N=1M # Mini-batch SGD: average over batch of 32 # ∇L ≈ (1/32) Σ ∇Lᵢ — noisy but fast! optimizer = torch.optim.SGD( model.parameters(), lr=0.01 ) # DataLoader gives random mini-batches loader = DataLoader(dataset, batch_size=32, shuffle=True) for batch in loader: loss = model(batch).loss loss.backward() # noisy gradient optimizer.step() # noisy step optimizer.zero_grad()
Real World
Survey the whole mountain (slow, accurate) vs. feel with one foot (fast, noisy)
In AI
Full batch (exact gradient, slow) vs. mini-batch (noisy gradient, fast, better generalization)
speed
Momentum — A Ball Rolling Downhill
Build speed in consistent directions, dampen oscillations
The Analogy
Vanilla SGD is like a hiker who stops and re-evaluates at every step. Momentum turns the hiker into a ball rolling downhill. The ball builds speed in directions it consistently rolls (the true downhill direction) and naturally dampens oscillations (the noisy side-to-side wobble). It uses a weighted average of past gradients, not just the current one.
Key insight: Without momentum, SGD zigzags in narrow valleys (imagine a bowling alley — the gradient points sideways toward the walls, not down the alley). Momentum smooths this out by accumulating velocity in the consistent direction (down the alley), making convergence much faster.
Worked Example
# Momentum: v = β×v + ∇L; w = w - lr×v # β = 0.9 (typical) — 90% of old velocity v = 0 # initial velocity beta = 0.9 # momentum coefficient lr = 0.01 for step in range(100): g = compute_gradient(w) v = beta * v + g # accumulate w = w - lr * v # update # PyTorch: optimizer = torch.optim.SGD( model.parameters(), lr=0.01, momentum=0.9 )
Intuition: β = 0.9 means the velocity is a weighted average of roughly the last 10 gradients (1/(1−0.9) = 10). This smooths out noise while preserving the consistent signal.
auto_awesome
Adam — The Default Optimizer
Momentum + adaptive learning rates per parameter
The Analogy
Adam is like a smart ball that not only builds momentum but also adjusts its step size per dimension. In a narrow canyon, it takes big steps along the floor (where gradients are small and consistent) and small steps toward the walls (where gradients are large and noisy). It maintains two running averages: the mean of gradients (momentum) and the variance of gradients (adaptive scaling).
Key insight: Adam divides each parameter’s update by the square root of its gradient variance. Parameters with consistently large gradients get smaller steps (they’re already moving fast). Parameters with small gradients get larger steps (they need a boost). This per-parameter adaptation is why Adam works so well out of the box.
The Algorithm
# Adam: Adaptive Moment Estimation # Kingma & Ba (2014) # m = β₁×m + (1-β₁)×g (mean) # v = β₂×v + (1-β₂)×g² (variance) # m̂ = m / (1-β₁ᵗ) (bias correct) # v̂ = v / (1-β₂ᵗ) (bias correct) # w = w - lr × m̂ / (√v̂ + ε) # Defaults: β₁=0.9, β₂=0.999, ε=1e-8 optimizer = torch.optim.Adam( model.parameters(), lr=0.001, # typical default betas=(0.9, 0.999), eps=1e-8 )
Source: Kingma & Ba (2014) “Adam: A Method for Stochastic Optimization.” Default for transformers, LLMs, and most modern architectures. Recent research (2025) shows β₁ = β₂ can improve performance via gradient scale invariance.
landscape
Convex vs. Non-Convex Landscapes
A bowl vs. a mountain range full of traps
The Analogy
A convex loss surface is like a bowl — there’s exactly one valley, and no matter where you start, gradient descent will find it. A non-convex surface is like a mountain range — multiple valleys, ridges, plateaus, and saddle points. Neural network loss surfaces are highly non-convex, yet gradient descent still works remarkably well.
Key insight: In high dimensions, local minima are rare but saddle points are everywhere. A saddle point is flat in some directions but curved in others — like a mountain pass. SGD’s noise helps it escape saddle points by randomly nudging the optimizer in a direction that curves downward.
Key Concepts
# Convex: f(x) = x² (one minimum) # → Gradient descent always finds it # Non-convex: neural network loss # → Many local minima, saddle points # → But most local minima are "good enough" # Saddle point: ∇L = 0 but NOT a minimum # f(x,y) = x² - y² # At (0,0): gradient = 0 # But it's a min in x, max in y → saddle # In 1000D, a random critical point has # ~500 upward and ~500 downward directions # → almost certainly a saddle, not a min
Convex
A bowl: one valley, guaranteed to find it
Non-Convex
Mountain range: many valleys, saddle points, but SGD noise helps navigate
warning
Common Traps & Solutions
Plateaus, exploding gradients, and learning rate schedules
The Traps
Plateaus: Flat regions where the gradient is near zero — the hiker stands still. Exploding gradients: Gradients grow exponentially through layers, causing wild weight updates. Vanishing gradients: Gradients shrink to zero, so early layers stop learning. Sharp minima: Narrow valleys that overfit — good on training data, bad on test data.
Warning: Exploding gradients can produce NaN losses in a single step. Gradient clipping (capping the gradient norm) is essential for training RNNs and transformers. Most modern training pipelines include torch.nn.utils.clip_grad_norm_ as standard practice.
Solutions
# Gradient clipping: cap gradient norm torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm=1.0 ) # Learning rate warmup + cosine decay scheduler = torch.optim.lr_scheduler \ .CosineAnnealingLR(optimizer, T_max=100) # Weight decay (L2 regularization) optimizer = torch.optim.AdamW( model.parameters(), lr=0.001, weight_decay=0.01 ) # Penalizes large weights → wider minima
Learning rate warmup: Start with a tiny lr, ramp up over the first ~1000 steps, then decay. This prevents early instability when the model’s initial random weights produce wild gradients.
rocket_launch
The Complete Training Recipe
Putting it all together: optimizer + schedule + clipping
The Analogy
Training a neural network is like a carefully planned expedition. You start cautiously (warmup), build speed (full learning rate), and slow down as you approach the valley (cosine decay). You carry safety gear (gradient clipping) and a map of past terrain (momentum). The expedition takes millions of steps, but each one follows the same simple loop: compute loss, compute gradient, update weights.
Why it matters for AI: The training loop below is essentially what trains GPT-4, Stable Diffusion, and every other modern AI model. The math is the same — only the scale changes. Understanding this loop means understanding the engine behind all of AI.
Production Training Loop
model = TransformerModel() optimizer = torch.optim.AdamW( model.parameters(), lr=3e-4, weight_decay=0.1, betas=(0.9, 0.95) ) scheduler = CosineWithWarmup( optimizer, warmup=2000, total=100000 ) for step, batch in enumerate(loader): loss = model(batch).loss loss.backward() torch.nn.utils.clip_grad_norm_( model.parameters(), 1.0 ) optimizer.step() scheduler.step() optimizer.zero_grad()
Real World
Expedition: start cautious, build speed, slow near destination, carry safety gear
In AI
Warmup → full lr → cosine decay, with gradient clipping and weight decay