Ch 4: Derivatives & Gradients — Mathematics Behind AI & ML

terrain

The Slope of the Hill

A derivative tells you how steep the ground is under your feet

The Analogy

You’re standing on a hill. The derivative is the slope under your feet — it tells you “if I take one tiny step forward, how much higher or lower will I be?” Positive slope = going uphill. Negative slope = going downhill. Zero slope = you’re at a flat spot (maybe a peak, a valley, or a saddle point).

Key insight: Every AI model is trained by “walking downhill” on a loss landscape. The derivative tells the model which direction is downhill. Without derivatives, AI training would be like trying to find the lowest point in a dark room by randomly stumbling around.

The Math

The derivative of f(x) at point a is the instantaneous rate of change:

# Derivative = limit of rise/run # f'(a) = lim(h→0) [f(a+h) - f(a)] / h # Example: f(x) = x² # f'(x) = 2x # At x=3: slope = 2×3 = 6 (steep uphill) # At x=0: slope = 2×0 = 0 (flat — minimum!) # At x=-2: slope = 2×(-2) = -4 (downhill) # Numerical approximation def derivative(f, x, h=1e-7): return (f(x + h) - f(x)) / h derivative(lambda x: x**2, 3) # ≈ 6.0

Real World

Slope under your feet: +6 = steep uphill, 0 = flat, −4 = downhill

In AI

Gradient of loss: positive = increase weight makes loss worse, negative = makes it better

functions

Derivative Rules — The Toolkit

Power rule, product rule, and the functions AI uses most

The Analogy

Derivative rules are like shortcut formulas so you don’t have to compute limits every time. The power rule says “bring the exponent down and subtract one.” The product rule handles two things multiplied together. These shortcuts let you compute slopes of arbitrarily complex functions in seconds.

Key insight: The derivative of e&supx; is e&supx; — it’s the only function that is its own derivative. This is why the exponential function appears everywhere in AI: softmax, sigmoid, Gaussian distributions. It’s mathematically “clean” to differentiate.

Key Rules

# Power rule: d/dx[xⁿ] = n·xⁿ⁻¹ # f(x) = x³ → f'(x) = 3x² # Exponential: d/dx[eˣ] = eˣ # (its own derivative!) # Log: d/dx[ln(x)] = 1/x # Product rule: d/dx[f·g] = f'g + fg' # AI-critical derivatives: # ReLU: f(x) = max(0,x) # f'(x) = 1 if x>0, else 0 # Sigmoid: f(x) = 1/(1+e⁻ˣ) # f'(x) = f(x)·(1-f(x)) # MSE loss: L = (y-ŷ)² # dL/dŷ = -2(y-ŷ)

Why ReLU won: Its derivative is either 0 or 1 — no multiplication, no saturation. Sigmoid’s derivative maxes out at 0.25, causing gradients to shrink (vanish) in deep networks. ReLU’s clean gradient is why it became the default activation.

view_in_ar

Partial Derivatives — One Knob at a Time

What happens when you have millions of variables?

The Analogy

Imagine a mixing board with 100 sliders controlling sound. A partial derivative answers: “If I nudge just this one slider while holding all others fixed, how does the output change?” You test one knob at a time. With 100 sliders, you get 100 partial derivatives — one per slider.

Key insight: A neural network with 175 billion parameters (like GPT-3) has 175 billion “sliders.” Training computes 175 billion partial derivatives every single step — one for each weight. That’s what backpropagation does: efficiently computes all those partial derivatives at once.

Worked Example

# f(x, y) = x²y + 3y # ∂f/∂x = 2xy (treat y as constant) # ∂f/∂y = x² + 3 (treat x as constant) # At (x=2, y=5): # ∂f/∂x = 2×2×5 = 20 # ∂f/∂y = 4 + 3 = 7 # AI example: loss L(w₁, w₂) # ∂L/∂w₁ = how loss changes if w₁ nudged # ∂L/∂w₂ = how loss changes if w₂ nudged import torch x = torch.tensor(2.0, requires_grad=True) y = torch.tensor(5.0, requires_grad=True) f = x**2 * y + 3 * y f.backward() x.grad # tensor(20.) — ∂f/∂x y.grad # tensor(7.) — ∂f/∂y

explore

The Gradient — A Compass Pointing Uphill

Collect all partial derivatives into one direction vector

The Analogy

Each partial derivative tells you the slope in one direction. The gradient bundles them all into a single vector that points in the direction of steepest ascent — like a compass that always points uphill. In AI, we want to go downhill (minimize loss), so we walk in the opposite direction of the gradient: −∇L.

Key insight: The gradient is the single most important concept in AI training. Every optimizer (SGD, Adam, AdaGrad) is just a different strategy for following the negative gradient downhill. The gradient IS the training signal.

Worked Example

# Gradient = vector of all partial derivatives # ∇f = [∂f/∂x, ∂f/∂y, ∂f/∂z, ...] # f(x,y) = x² + y² (a bowl) # ∇f = [2x, 2y] # At (3, 4): ∇f = [6, 8] # Points uphill → walk opposite: [-6, -8] # Gradient descent update rule: # w_new = w_old - learning_rate × ∇L(w_old) lr = 0.01 w = torch.tensor([3.0, 4.0], requires_grad=True) loss = (w**2).sum() # x² + y² = 25 loss.backward() w.grad # tensor([6., 8.]) — the gradient # w_new = [3-0.06, 4-0.08] = [2.94, 3.92]

Formula: ∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]. It points in the direction of steepest increase. Its magnitude = steepness of that direction.

navigation

Directional Derivatives

The slope in any direction you choose

The Analogy

The gradient tells you the steepest direction. But what if you want to know the slope in a specific direction — say, northeast? The directional derivative answers: “How steep is the hill if I walk in direction u?” It’s the dot product of the gradient with your chosen direction: D_u f = ∇f · u.

Key insight: The directional derivative is maximized when you walk in the gradient direction (steepest ascent) and minimized when you walk opposite (steepest descent). Walking perpendicular to the gradient gives zero slope — you’re traversing a contour line, like walking along the edge of a hill without going up or down.

Worked Example

# f(x,y) = x² + y² # ∇f at (3,4) = [6, 8] # Direction: northeast = [1/√2, 1/√2] grad = np.array([6, 8]) u = np.array([1, 1]) / np.sqrt(2) # Directional derivative = ∇f · u D_u = np.dot(grad, u) # ≈ 9.9 # Compare with gradient direction: grad_dir = grad / np.linalg.norm(grad) D_max = np.dot(grad, grad_dir) # 10.0 # Gradient direction is steepest! (10 > 9.9)

Real World

Hiker asks: “How steep is it if I walk northeast?”

In AI

Gradient direction = steepest descent for loss minimization

grid_on

The Jacobian — Gradients for Vector Outputs

When your function outputs multiple values

The Analogy

A gradient works when you have one output (like loss). But what if your function has multiple outputs? Imagine a weather model that predicts both temperature and humidity from pressure and wind speed. The Jacobian is a matrix where each row is the gradient of one output. It’s a “gradient for each output, stacked together.”

Key insight: In a neural network, each layer maps a vector to a vector. The Jacobian of that layer tells you how each output changes with each input. Backpropagation multiplies Jacobians together layer by layer — that’s the chain rule in matrix form (Chapter 5).

Worked Example

# f: R² → R² (2 inputs, 2 outputs) # f₁(x,y) = x²y f₂(x,y) = x + y³ # Jacobian J = [[∂f₁/∂x, ∂f₁/∂y], # [∂f₂/∂x, ∂f₂/∂y]] # = [[2xy, x² ], # [1, 3y² ]] # At (x=2, y=3): # J = [[12, 4], # [1, 27]] # PyTorch computes Jacobian-vector products # (not full Jacobian — too expensive!) x = torch.tensor([2.0, 3.0], requires_grad=True) y = torch.stack([x[0]**2*x[1], x[0]+x[1]**3])

Shape: For f: R⊃n → R⊃m, the Jacobian is m×n. Each row i is the gradient of output i. For a layer with 512 inputs and 256 outputs, J is 256×512.

waves

The Hessian — Curvature of the Landscape

Is the hill curving like a bowl or a saddle?

The Analogy

The gradient tells you the slope. The Hessian tells you the curvature — is the hill curving like a bowl (minimum), a dome (maximum), or a saddle (up in one direction, down in another)? It’s the matrix of second derivatives: how the slope itself is changing.

Key insight: In high-dimensional loss landscapes, most critical points are saddle points, not local minima. The Hessian’s eigenvalues tell you: all positive = bowl (minimum), all negative = dome (maximum), mixed = saddle. Research shows neural network loss surfaces have exponentially more saddle points than minima.

Worked Example

# f(x,y) = x² - y² (saddle function) # ∇f = [2x, -2y] # Hessian H = [[∂²f/∂x², ∂²f/∂x∂y], # [∂²f/∂y∂x, ∂²f/∂y²]] # = [[2, 0], # [0, -2]] # Eigenvalues of H: +2 and -2 # Mixed signs → SADDLE POINT! # Bowl: f(x,y) = x² + y² # H = [[2,0],[0,2]], eigenvalues: +2, +2 # All positive → MINIMUM ✓ # Second-order optimizers (L-BFGS, Newton) # use Hessian info for better steps # but computing H is O(n²) — too expensive # for billions of parameters

Computational cost: For a model with n parameters, the Hessian is n×n. GPT-3 has 175B params — the Hessian would be 175B × 175B. That’s why first-order methods (SGD, Adam) dominate: they only need the gradient, not the Hessian.

landscape

The Loss Landscape — Where It All Comes Together

Visualizing the terrain every AI model navigates

The Analogy

The loss landscape is a vast mountain range where every point represents a set of model weights and the elevation is the loss (error). Training = hiking to the lowest valley. The gradient is your compass. The learning rate is your step size. The landscape has peaks, valleys, ridges, saddle points, and flat plateaus — all in millions of dimensions.

Why it matters for AI: Research by Li et al. (2018) showed that loss landscapes of neural networks can be visualized and that wider minima (flat valleys) generalize better than sharp minima (narrow valleys). This insight drives techniques like learning rate warmup, weight decay, and stochastic weight averaging.

The Training Loop

import torch # The fundamental training loop model = MyNeuralNet() optimizer = torch.optim.Adam( model.parameters(), lr=0.001 ) for batch in dataloader: # 1. Forward pass: compute loss loss = loss_fn(model(batch.x), batch.y) # 2. Backward pass: compute gradients loss.backward() # ∇L for ALL params # 3. Update: step downhill optimizer.step() # w -= lr × ∇L optimizer.zero_grad() # reset

Real World

Hiker with compass (gradient) and step size (learning rate) seeking the valley

In AI

loss.backward() computes the gradient, optimizer.step() walks downhill

Ch 4 — Derivatives & Gradients