Ch 6: Optimization & Gradient Descent — Mathematics Behind AI & ML

visibility_off

Blindfolded on a Mountain

Gradient descent: feel the slope, step downhill, repeat

The Analogy

You’re blindfolded on a mountain and need to reach the valley. You can’t see, but you can feel the slope under your feet. Strategy: feel which direction goes downhill, take a step that way, repeat. That’s gradient descent. The gradient tells you the slope. You step in the opposite direction (downhill). The learning rate is your step size.

Key insight: Every AI model ever trained uses some form of this algorithm. GPT-4, Stable Diffusion, AlphaFold — all of them are blindfolded hikers feeling the slope and stepping downhill, billions of times, until they find a good valley.

The Algorithm

# Vanilla gradient descent # w = w - lr × ∇L(w) w = torch.randn(100, requires_grad=True) lr = 0.01 for step in range(1000): loss = compute_loss(w) # elevation loss.backward() # feel slope with torch.no_grad(): w -= lr * w.grad # step downhill w.grad.zero_() # reset compass

Real World

Blindfolded hiker: feel slope, step downhill, repeat

In AI

Compute gradient, update weights, repeat for millions of steps

tune

Learning Rate — Step Size Matters

Too big and you overshoot. Too small and you’re there forever.

The Analogy

The learning rate is your step size on the mountain. Too large: you leap over the valley and end up on the opposite slope, bouncing back and forth (divergence). Too small: you take baby steps and it takes forever to reach the valley. Just right: you descend efficiently and settle into the minimum.

Key insight: The learning rate is often the single most important hyperparameter in deep learning. A 10× change in learning rate can mean the difference between a model that converges beautifully and one that completely fails to train. This is why learning rate schedules (warmup, cosine decay) are so important.

Worked Example

# f(x) = x² — minimum at x=0 # gradient: f'(x) = 2x # Too large (lr=1.1): DIVERGES x = 5.0 # x = 5 - 1.1×10 = -6.0 (overshot!) # x = -6 - 1.1×(-12) = 7.2 (worse!) # Too small (lr=0.001): SLOW x = 5.0 # x = 5 - 0.001×10 = 4.99 # After 1000 steps: x ≈ 0.67 (still far) # Just right (lr=0.1): CONVERGES x = 5.0 # x = 5 - 0.1×10 = 4.0 # x = 4 - 0.1×8 = 3.2 # After 50 steps: x ≈ 0.0 ✓

Too Large

Leap over the valley, bounce wildly, diverge

Just Right

Steady descent, efficient convergence to minimum

casino

Stochastic Gradient Descent (SGD)

Using a noisy compass instead of a perfect one

The Analogy

Computing the gradient on the entire dataset is like surveying the entire mountain before each step — accurate but impossibly slow. SGD uses a small random sample (mini-batch) to estimate the gradient. It’s like feeling the slope with just your left foot instead of mapping the whole hillside. The estimate is noisy but fast, and the noise actually helps escape bad local minima.

Key insight: The “stochastic” noise in SGD is a feature, not a bug. It helps the optimizer escape sharp, narrow minima (which generalize poorly) and settle into wide, flat minima (which generalize well). This is why SGD often finds better solutions than exact gradient descent.

Worked Example

# Full gradient: average over ALL data # ∇L = (1/N) Σ ∇Lᵢ — expensive for N=1M # Mini-batch SGD: average over batch of 32 # ∇L ≈ (1/32) Σ ∇Lᵢ — noisy but fast! optimizer = torch.optim.SGD( model.parameters(), lr=0.01 ) # DataLoader gives random mini-batches loader = DataLoader(dataset, batch_size=32, shuffle=True) for batch in loader: loss = model(batch).loss loss.backward() # noisy gradient optimizer.step() # noisy step optimizer.zero_grad()

Real World

Survey the whole mountain (slow, accurate) vs. feel with one foot (fast, noisy)

In AI

Full batch (exact gradient, slow) vs. mini-batch (noisy gradient, fast, better generalization)

speed

Momentum — A Ball Rolling Downhill

Build speed in consistent directions, dampen oscillations

The Analogy

Vanilla SGD is like a hiker who stops and re-evaluates at every step. Momentum turns the hiker into a ball rolling downhill. The ball builds speed in directions it consistently rolls (the true downhill direction) and naturally dampens oscillations (the noisy side-to-side wobble). It uses a weighted average of past gradients, not just the current one.

Key insight: Without momentum, SGD zigzags in narrow valleys (imagine a bowling alley — the gradient points sideways toward the walls, not down the alley). Momentum smooths this out by accumulating velocity in the consistent direction (down the alley), making convergence much faster.

Worked Example

# Momentum: v = β×v + ∇L; w = w - lr×v # β = 0.9 (typical) — 90% of old velocity v = 0 # initial velocity beta = 0.9 # momentum coefficient lr = 0.01 for step in range(100): g = compute_gradient(w) v = beta * v + g # accumulate w = w - lr * v # update # PyTorch: optimizer = torch.optim.SGD( model.parameters(), lr=0.01, momentum=0.9 )

Intuition: β = 0.9 means the velocity is a weighted average of roughly the last 10 gradients (1/(1−0.9) = 10). This smooths out noise while preserving the consistent signal.

auto_awesome

Adam — The Default Optimizer

Momentum + adaptive learning rates per parameter

The Analogy

Adam is like a smart ball that not only builds momentum but also adjusts its step size per dimension. In a narrow canyon, it takes big steps along the floor (where gradients are small and consistent) and small steps toward the walls (where gradients are large and noisy). It maintains two running averages: the mean of gradients (momentum) and the variance of gradients (adaptive scaling).

Key insight: Adam divides each parameter’s update by the square root of its gradient variance. Parameters with consistently large gradients get smaller steps (they’re already moving fast). Parameters with small gradients get larger steps (they need a boost). This per-parameter adaptation is why Adam works so well out of the box.

The Algorithm

# Adam: Adaptive Moment Estimation # Kingma & Ba (2014) # m = β₁×m + (1-β₁)×g (mean) # v = β₂×v + (1-β₂)×g² (variance) # m̂ = m / (1-β₁ᵗ) (bias correct) # v̂ = v / (1-β₂ᵗ) (bias correct) # w = w - lr × m̂ / (√v̂ + ε) # Defaults: β₁=0.9, β₂=0.999, ε=1e-8 optimizer = torch.optim.Adam( model.parameters(), lr=0.001, # typical default betas=(0.9, 0.999), eps=1e-8 )

Source: Kingma & Ba (2014) “Adam: A Method for Stochastic Optimization.” Default for transformers, LLMs, and most modern architectures. Recent research (2025) shows β₁ = β₂ can improve performance via gradient scale invariance.

landscape

Convex vs. Non-Convex Landscapes

A bowl vs. a mountain range full of traps

The Analogy

A convex loss surface is like a bowl — there’s exactly one valley, and no matter where you start, gradient descent will find it. A non-convex surface is like a mountain range — multiple valleys, ridges, plateaus, and saddle points. Neural network loss surfaces are highly non-convex, yet gradient descent still works remarkably well.

Key insight: In high dimensions, local minima are rare but saddle points are everywhere. A saddle point is flat in some directions but curved in others — like a mountain pass. SGD’s noise helps it escape saddle points by randomly nudging the optimizer in a direction that curves downward.

Key Concepts

# Convex: f(x) = x² (one minimum) # → Gradient descent always finds it # Non-convex: neural network loss # → Many local minima, saddle points # → But most local minima are "good enough" # Saddle point: ∇L = 0 but NOT a minimum # f(x,y) = x² - y² # At (0,0): gradient = 0 # But it's a min in x, max in y → saddle # In 1000D, a random critical point has # ~500 upward and ~500 downward directions # → almost certainly a saddle, not a min

Convex

A bowl: one valley, guaranteed to find it

Non-Convex

Mountain range: many valleys, saddle points, but SGD noise helps navigate

warning

Common Traps & Solutions

Plateaus, exploding gradients, and learning rate schedules

The Traps

Plateaus: Flat regions where the gradient is near zero — the hiker stands still. Exploding gradients: Gradients grow exponentially through layers, causing wild weight updates. Vanishing gradients: Gradients shrink to zero, so early layers stop learning. Sharp minima: Narrow valleys that overfit — good on training data, bad on test data.

Warning: Exploding gradients can produce NaN losses in a single step. Gradient clipping (capping the gradient norm) is essential for training RNNs and transformers. Most modern training pipelines include torch.nn.utils.clip_grad_norm_ as standard practice.

Solutions

# Gradient clipping: cap gradient norm torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm=1.0 ) # Learning rate warmup + cosine decay scheduler = torch.optim.lr_scheduler \ .CosineAnnealingLR(optimizer, T_max=100) # Weight decay (L2 regularization) optimizer = torch.optim.AdamW( model.parameters(), lr=0.001, weight_decay=0.01 ) # Penalizes large weights → wider minima

Learning rate warmup: Start with a tiny lr, ramp up over the first ~1000 steps, then decay. This prevents early instability when the model’s initial random weights produce wild gradients.

rocket_launch

The Complete Training Recipe

Putting it all together: optimizer + schedule + clipping

The Analogy

Training a neural network is like a carefully planned expedition. You start cautiously (warmup), build speed (full learning rate), and slow down as you approach the valley (cosine decay). You carry safety gear (gradient clipping) and a map of past terrain (momentum). The expedition takes millions of steps, but each one follows the same simple loop: compute loss, compute gradient, update weights.

Why it matters for AI: The training loop below is essentially what trains GPT-4, Stable Diffusion, and every other modern AI model. The math is the same — only the scale changes. Understanding this loop means understanding the engine behind all of AI.

Production Training Loop

model = TransformerModel() optimizer = torch.optim.AdamW( model.parameters(), lr=3e-4, weight_decay=0.1, betas=(0.9, 0.95) ) scheduler = CosineWithWarmup( optimizer, warmup=2000, total=100000 ) for step, batch in enumerate(loader): loss = model(batch).loss loss.backward() torch.nn.utils.clip_grad_norm_( model.parameters(), 1.0 ) optimizer.step() scheduler.step() optimizer.zero_grad()

Real World

Expedition: start cautious, build speed, slow near destination, carry safety gear

In AI

Warmup → full lr → cosine decay, with gradient clipping and weight decay

Ch 6 — Optimization & Gradient Descent