Ch 4 — Derivatives & Gradients

The slope of the hill you’re standing on — the compass that guides every AI model
Calculus
terrain
Slope
arrow_forward
functions
Rules
arrow_forward
view_in_ar
Partial
arrow_forward
explore
Gradient
arrow_forward
grid_on
Jacobian
arrow_forward
waves
Hessian
arrow_forward
landscape
Loss
-
Click play or press Space to begin...
Step- / 8
terrain
The Slope of the Hill
A derivative tells you how steep the ground is under your feet
The Analogy
You’re standing on a hill. The derivative is the slope under your feet — it tells you “if I take one tiny step forward, how much higher or lower will I be?” Positive slope = going uphill. Negative slope = going downhill. Zero slope = you’re at a flat spot (maybe a peak, a valley, or a saddle point).
Key insight: Every AI model is trained by “walking downhill” on a loss landscape. The derivative tells the model which direction is downhill. Without derivatives, AI training would be like trying to find the lowest point in a dark room by randomly stumbling around.
The Math
The derivative of f(x) at point a is the instantaneous rate of change:
# Derivative = limit of rise/run # f'(a) = lim(h→0) [f(a+h) - f(a)] / h # Example: f(x) = x² # f'(x) = 2x # At x=3: slope = 2×3 = 6 (steep uphill) # At x=0: slope = 2×0 = 0 (flat — minimum!) # At x=-2: slope = 2×(-2) = -4 (downhill) # Numerical approximation def derivative(f, x, h=1e-7): return (f(x + h) - f(x)) / h derivative(lambda x: x**2, 3) # ≈ 6.0
Real World
Slope under your feet: +6 = steep uphill, 0 = flat, −4 = downhill
In AI
Gradient of loss: positive = increase weight makes loss worse, negative = makes it better
functions
Derivative Rules — The Toolkit
Power rule, product rule, and the functions AI uses most
The Analogy
Derivative rules are like shortcut formulas so you don’t have to compute limits every time. The power rule says “bring the exponent down and subtract one.” The product rule handles two things multiplied together. These shortcuts let you compute slopes of arbitrarily complex functions in seconds.
Key insight: The derivative of e&supx; is e&supx; — it’s the only function that is its own derivative. This is why the exponential function appears everywhere in AI: softmax, sigmoid, Gaussian distributions. It’s mathematically “clean” to differentiate.
Key Rules
# Power rule: d/dx[xⁿ] = n·xⁿ⁻¹ # f(x) = x³ → f'(x) = 3x² # Exponential: d/dx[eˣ] = eˣ # (its own derivative!) # Log: d/dx[ln(x)] = 1/x # Product rule: d/dx[f·g] = f'g + fg' # AI-critical derivatives: # ReLU: f(x) = max(0,x) # f'(x) = 1 if x>0, else 0 # Sigmoid: f(x) = 1/(1+e⁻ˣ) # f'(x) = f(x)·(1-f(x)) # MSE loss: L = (y-ŷ)² # dL/dŷ = -2(y-ŷ)
Why ReLU won: Its derivative is either 0 or 1 — no multiplication, no saturation. Sigmoid’s derivative maxes out at 0.25, causing gradients to shrink (vanish) in deep networks. ReLU’s clean gradient is why it became the default activation.
view_in_ar
Partial Derivatives — One Knob at a Time
What happens when you have millions of variables?
The Analogy
Imagine a mixing board with 100 sliders controlling sound. A partial derivative answers: “If I nudge just this one slider while holding all others fixed, how does the output change?” You test one knob at a time. With 100 sliders, you get 100 partial derivatives — one per slider.
Key insight: A neural network with 175 billion parameters (like GPT-3) has 175 billion “sliders.” Training computes 175 billion partial derivatives every single step — one for each weight. That’s what backpropagation does: efficiently computes all those partial derivatives at once.
Worked Example
# f(x, y) = x²y + 3y # ∂f/∂x = 2xy (treat y as constant) # ∂f/∂y = x² + 3 (treat x as constant) # At (x=2, y=5): # ∂f/∂x = 2×2×5 = 20 # ∂f/∂y = 4 + 3 = 7 # AI example: loss L(w₁, w₂) # ∂L/∂w₁ = how loss changes if w₁ nudged # ∂L/∂w₂ = how loss changes if w₂ nudged import torch x = torch.tensor(2.0, requires_grad=True) y = torch.tensor(5.0, requires_grad=True) f = x**2 * y + 3 * y f.backward() x.grad # tensor(20.) — ∂f/∂x y.grad # tensor(7.) — ∂f/∂y
explore
The Gradient — A Compass Pointing Uphill
Collect all partial derivatives into one direction vector
The Analogy
Each partial derivative tells you the slope in one direction. The gradient bundles them all into a single vector that points in the direction of steepest ascent — like a compass that always points uphill. In AI, we want to go downhill (minimize loss), so we walk in the opposite direction of the gradient: −∇L.
Key insight: The gradient is the single most important concept in AI training. Every optimizer (SGD, Adam, AdaGrad) is just a different strategy for following the negative gradient downhill. The gradient IS the training signal.
Worked Example
# Gradient = vector of all partial derivatives # ∇f = [∂f/∂x, ∂f/∂y, ∂f/∂z, ...] # f(x,y) = x² + y² (a bowl) # ∇f = [2x, 2y] # At (3, 4): ∇f = [6, 8] # Points uphill → walk opposite: [-6, -8] # Gradient descent update rule: # w_new = w_old - learning_rate × ∇L(w_old) lr = 0.01 w = torch.tensor([3.0, 4.0], requires_grad=True) loss = (w**2).sum() # x² + y² = 25 loss.backward() w.grad # tensor([6., 8.]) — the gradient # w_new = [3-0.06, 4-0.08] = [2.94, 3.92]
Formula: ∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]. It points in the direction of steepest increase. Its magnitude = steepness of that direction.
navigation
Directional Derivatives
The slope in any direction you choose
The Analogy
The gradient tells you the steepest direction. But what if you want to know the slope in a specific direction — say, northeast? The directional derivative answers: “How steep is the hill if I walk in direction u?” It’s the dot product of the gradient with your chosen direction: D_u f = ∇f · u.
Key insight: The directional derivative is maximized when you walk in the gradient direction (steepest ascent) and minimized when you walk opposite (steepest descent). Walking perpendicular to the gradient gives zero slope — you’re traversing a contour line, like walking along the edge of a hill without going up or down.
Worked Example
# f(x,y) = x² + y² # ∇f at (3,4) = [6, 8] # Direction: northeast = [1/√2, 1/√2] grad = np.array([6, 8]) u = np.array([1, 1]) / np.sqrt(2) # Directional derivative = ∇f · u D_u = np.dot(grad, u) # ≈ 9.9 # Compare with gradient direction: grad_dir = grad / np.linalg.norm(grad) D_max = np.dot(grad, grad_dir) # 10.0 # Gradient direction is steepest! (10 > 9.9)
Real World
Hiker asks: “How steep is it if I walk northeast?”
In AI
Gradient direction = steepest descent for loss minimization
grid_on
The Jacobian — Gradients for Vector Outputs
When your function outputs multiple values
The Analogy
A gradient works when you have one output (like loss). But what if your function has multiple outputs? Imagine a weather model that predicts both temperature and humidity from pressure and wind speed. The Jacobian is a matrix where each row is the gradient of one output. It’s a “gradient for each output, stacked together.”
Key insight: In a neural network, each layer maps a vector to a vector. The Jacobian of that layer tells you how each output changes with each input. Backpropagation multiplies Jacobians together layer by layer — that’s the chain rule in matrix form (Chapter 5).
Worked Example
# f: R² → R² (2 inputs, 2 outputs) # f₁(x,y) = x²y f₂(x,y) = x + y³ # Jacobian J = [[∂f₁/∂x, ∂f₁/∂y], # [∂f₂/∂x, ∂f₂/∂y]] # = [[2xy, x² ], # [1, 3y² ]] # At (x=2, y=3): # J = [[12, 4], # [1, 27]] # PyTorch computes Jacobian-vector products # (not full Jacobian — too expensive!) x = torch.tensor([2.0, 3.0], requires_grad=True) y = torch.stack([x[0]**2*x[1], x[0]+x[1]**3])
Shape: For f: R⊃n → R⊃m, the Jacobian is m×n. Each row i is the gradient of output i. For a layer with 512 inputs and 256 outputs, J is 256×512.
waves
The Hessian — Curvature of the Landscape
Is the hill curving like a bowl or a saddle?
The Analogy
The gradient tells you the slope. The Hessian tells you the curvature — is the hill curving like a bowl (minimum), a dome (maximum), or a saddle (up in one direction, down in another)? It’s the matrix of second derivatives: how the slope itself is changing.
Key insight: In high-dimensional loss landscapes, most critical points are saddle points, not local minima. The Hessian’s eigenvalues tell you: all positive = bowl (minimum), all negative = dome (maximum), mixed = saddle. Research shows neural network loss surfaces have exponentially more saddle points than minima.
Worked Example
# f(x,y) = x² - y² (saddle function) # ∇f = [2x, -2y] # Hessian H = [[∂²f/∂x², ∂²f/∂x∂y], # [∂²f/∂y∂x, ∂²f/∂y²]] # = [[2, 0], # [0, -2]] # Eigenvalues of H: +2 and -2 # Mixed signs → SADDLE POINT! # Bowl: f(x,y) = x² + y² # H = [[2,0],[0,2]], eigenvalues: +2, +2 # All positive → MINIMUM ✓ # Second-order optimizers (L-BFGS, Newton) # use Hessian info for better steps # but computing H is O(n²) — too expensive # for billions of parameters
Computational cost: For a model with n parameters, the Hessian is n×n. GPT-3 has 175B params — the Hessian would be 175B × 175B. That’s why first-order methods (SGD, Adam) dominate: they only need the gradient, not the Hessian.
landscape
The Loss Landscape — Where It All Comes Together
Visualizing the terrain every AI model navigates
The Analogy
The loss landscape is a vast mountain range where every point represents a set of model weights and the elevation is the loss (error). Training = hiking to the lowest valley. The gradient is your compass. The learning rate is your step size. The landscape has peaks, valleys, ridges, saddle points, and flat plateaus — all in millions of dimensions.
Why it matters for AI: Research by Li et al. (2018) showed that loss landscapes of neural networks can be visualized and that wider minima (flat valleys) generalize better than sharp minima (narrow valleys). This insight drives techniques like learning rate warmup, weight decay, and stochastic weight averaging.
The Training Loop
import torch # The fundamental training loop model = MyNeuralNet() optimizer = torch.optim.Adam( model.parameters(), lr=0.001 ) for batch in dataloader: # 1. Forward pass: compute loss loss = loss_fn(model(batch.x), batch.y) # 2. Backward pass: compute gradients loss.backward() # ∇L for ALL params # 3. Update: step downhill optimizer.step() # w -= lr × ∇L optimizer.zero_grad() # reset
Real World
Hiker with compass (gradient) and step size (learning rate) seeking the valley
In AI
loss.backward() computes the gradient, optimizer.step() walks downhill