Ch 5 — Chain Rule & Backpropagation

The blame chain — tracing errors backward through a neural network
Calculus
restaurant
Blame
arrow_forward
link
Chain Rule
arrow_forward
account_tree
Comp Graph
arrow_forward
arrow_forward
Forward
arrow_forward
arrow_back
Backward
arrow_forward
calculate
Worked Ex
arrow_forward
auto_awesome
Autograd
-
Click play or press Space to begin...
Step- / 8
restaurant
The Blame Chain
Bad meal? Blame the chef. Chef blames the supplier. Supplier blames the farmer.
The Analogy
You eat a terrible meal at a restaurant. You blame the chef. The chef says “the ingredients were bad” and blames the supplier. The supplier says “the crops were poor” and blames the farmer. Each person passes blame backward to whoever gave them their input. Backpropagation does exactly this: it traces blame (error) backward through the network, telling each weight how much it contributed to the final mistake.
Key insight: The “blame” each weight receives IS its gradient. A weight that contributed a lot to the error gets a large gradient (lots of blame) and gets adjusted more. A weight that barely affected the output gets a tiny gradient and barely changes.
The Blame Chain in a Network
# Neural network as a blame chain: # # Input → [Layer 1] → [Layer 2] → [Layer 3] → Loss # x → w₁·x+b₁ → w₂·h₁+b₂ → w₃·h₂+b₃ → L # # Backprop traces blame backward: # Loss → How much did Layer 3 cause this? # → How much did Layer 2 cause this? # → How much did Layer 1 cause this? # # Each layer's "blame" = its gradient # ∂L/∂w₃, ∂L/∂w₂, ∂L/∂w₁
Real World
Bad meal → blame chef → blame supplier → blame farmer
In AI
High loss → blame layer 3 → blame layer 2 → blame layer 1
link
The Chain Rule — Multiplying Blame
If A affects B, and B affects C, then A affects C through B
The Analogy
If doubling the fertilizer doubles the crop yield, and doubling the crop yield doubles the restaurant quality, then doubling the fertilizer quadruples the restaurant quality. The effects multiply through the chain. That’s the chain rule: dC/dA = dC/dB × dB/dA.
Key insight: Backpropagation IS the chain rule. Nothing more, nothing less. It’s just the chain rule applied systematically to a computational graph. Every “deep learning breakthrough” in training ultimately relies on this one calculus rule from the 1600s.
Worked Example
# Chain rule: df/dx = df/du × du/dx # Example: f(x) = (3x + 2)² # Let u = 3x + 2, then f = u² # df/du = 2u # du/dx = 3 # df/dx = 2u × 3 = 6(3x + 2) # At x = 1: u = 5 # df/dx = 6 × 5 = 30 # Multi-step chain: # dL/dw₁ = dL/dh₃ × dh₃/dh₂ × dh₂/dh₁ × dh₁/dw₁ # Each × is one link in the blame chain
Formula: For f(g(h(x))): df/dx = df/dg × dg/dh × dh/dx. Each factor is one “link” in the chain. More layers = more multiplications.
account_tree
Computational Graphs
Drawing the recipe so we can trace blame backward
The Analogy
A computational graph is like a recipe flowchart. Each node is an operation (add, multiply, square). Edges show what feeds into what. To trace blame backward, you follow the arrows in reverse, multiplying local derivatives at each step. PyTorch builds this graph automatically during the forward pass.
Key insight: PyTorch’s requires_grad=True tells the system “record every operation on this tensor.” During the forward pass, PyTorch secretly builds a DAG (directed acyclic graph) of all operations. When you call .backward(), it walks this graph in reverse, applying the chain rule at every node.
Worked Example
# Computational graph for: L = (w×x + b)² # # w ──┐ # ├── [×] ── z ──┐ # x ──┘ ├── [+] ── h ── [²] ── L # b ───┘ # # Forward: z=w×x, h=z+b, L=h² # Backward (chain rule): # dL/dh = 2h # dL/db = dL/dh × dh/db = 2h × 1 = 2h # dL/dz = dL/dh × dh/dz = 2h × 1 = 2h # dL/dw = dL/dz × dz/dw = 2h × x
Real World
Recipe flowchart: ingredients → steps → dish
In AI
Computational graph: inputs → operations → loss
arrow_forward
Forward Pass — Computing the Output
Run the recipe from inputs to loss
The Analogy
The forward pass is cooking the meal: take ingredients (input), follow the recipe (network operations), produce the dish (output), and taste it (compute loss). During this process, PyTorch saves every intermediate result — like a security camera recording every step of the cooking process, so you can review the footage later to figure out what went wrong.
Key insight: The forward pass must save intermediate values (activations) because the backward pass needs them. This is why training uses ~2–3× more memory than inference — you’re storing the “security footage” for backpropagation. Gradient checkpointing trades compute for memory by recomputing instead of storing.
Worked Example with Numbers
# Forward pass: L = (w×x + b)² # w=2, x=3, b=1 w, x, b = 2, 3, 1 # Step 1: z = w × x = 2 × 3 = 6 z = w * x # 6 (saved!) # Step 2: h = z + b = 6 + 1 = 7 h = z + b # 7 (saved!) # Step 3: L = h² = 7² = 49 L = h ** 2 # 49 (the loss) # All intermediates saved for backward pass
arrow_back
Backward Pass — Tracing Blame
Walk the graph in reverse, multiplying local gradients
The Analogy
Now you review the security footage in reverse. Start at the loss and ask: “How much did the last step contribute?” Then: “How much did the step before that contribute?” At each node, you compute the local derivative and multiply it with the incoming blame. By the time you reach the weights, each one knows exactly how much it’s to blame.
Key insight: The backward pass visits each node exactly once, computing one local derivative and one multiplication. For a network with N operations, backprop costs roughly 2× the forward pass — NOT N× more. This efficiency is why deep learning is practical at all.
Worked Example with Numbers
# Backward pass (continuing from forward) # w=2, x=3, b=1, z=6, h=7, L=49 # Start: dL/dL = 1 (trivially) # Step 3 backward: L = h² dL_dh = 2 * h # 2×7 = 14 # Step 2 backward: h = z + b dL_dz = dL_dh * 1 # 14 × 1 = 14 dL_db = dL_dh * 1 # 14 × 1 = 14 # Step 1 backward: z = w × x dL_dw = dL_dz * x # 14 × 3 = 42 dL_dx = dL_dz * w # 14 × 2 = 28 # Result: ∂L/∂w = 42, ∂L/∂b = 14 # w gets 3× more blame than b!
calculate
Backprop Through a Real Layer
A complete worked example with a 2-layer network
The Setup
Let’s trace backprop through a tiny 2-layer network with ReLU activation. Input x = 2, target y = 1, MSE loss. This is the exact process happening billions of times during training.
# 2-layer network: x → w₁ → ReLU → w₂ → loss x = 2.0; y = 1.0 w1 = 0.5; w2 = -0.3 # Forward: z1 = w1 * x # 0.5 × 2 = 1.0 a1 = max(0, z1) # ReLU(1.0) = 1.0 z2 = w2 * a1 # -0.3 × 1.0 = -0.3 L = (z2 - y)**2 # (-0.3-1)² = 1.69
Backward Pass
# Backward: dL_dz2 = 2*(z2-y) # 2×(-1.3) = -2.6 dL_dw2 = dL_dz2 * a1 # -2.6 × 1.0 = -2.6 dL_da1 = dL_dz2 * w2 # -2.6 × -0.3 = 0.78 dL_dz1 = dL_da1 * 1 # 0.78 (ReLU grad=1) dL_dw1 = dL_dz1 * x # 0.78 × 2 = 1.56 # Update (lr=0.1): w1 -= 0.1 * 1.56 # 0.5 → 0.344 w2 -= 0.1 * (-2.6) # -0.3 → -0.04 # w2 moved toward positive (reducing error)
Notice: w2 got a negative gradient (−2.6), so it increased (moved toward positive), which makes the output closer to the target y = 1. The math automatically figures out which direction each weight should move.
swap_horiz
Forward vs. Reverse Mode Autodiff
Why backprop is reverse mode — and why it’s efficient
The Analogy
Forward mode: For each input, trace its influence forward through every output. If you have 1 million inputs and 1 output, that’s 1 million forward passes. Reverse mode: For each output, trace blame backward through all inputs. With 1 output (loss) and 1 million inputs (weights), that’s just 1 backward pass. Neural networks have many inputs (weights) and one output (loss), so reverse mode wins massively.
Key insight: This is why backpropagation (reverse mode) was revolutionary. Computing all 175 billion gradients of GPT-3 costs roughly 2–3× one forward pass. Forward mode would cost 175 billion forward passes. Reverse mode made deep learning computationally feasible.
Complexity Comparison
# Forward mode autodiff: # Cost = O(n × forward_pass) # where n = number of input parameters # GPT-3: 175B × forward_pass 😱 # Reverse mode autodiff (backprop): # Cost = O(m × forward_pass) # where m = number of outputs (usually 1) # GPT-3: 1 × forward_pass × ~2-3 ✓ # Forward mode wins when: # few inputs, many outputs (rare in ML) # Reverse mode wins when: # many inputs, few outputs (always in ML!)
Forward Mode
Ask each ingredient: “how do you affect the final dish?” (1M questions)
Reverse Mode
Ask the dish: “which ingredients caused this?” (1 question, all answers)
auto_awesome
PyTorch Autograd — Automatic Backprop
You write the forward pass; PyTorch handles the backward pass
The Analogy
PyTorch’s autograd is like having a robotic accountant who watches you cook, records every step, and automatically calculates exactly how much each ingredient contributed to the final taste. You just cook (write the forward pass). The accountant handles all the blame-tracing (backward pass) automatically.
Why it matters for AI: Before autograd, researchers had to manually derive and implement gradients for every new architecture. PyTorch’s dynamic computational graph means you can use Python control flow (if/else, loops) and autograd still works. This is why PyTorch became the dominant research framework.
In Practice
import torch # Same 2-layer example, but PyTorch does it x = torch.tensor(2.0) y = torch.tensor(1.0) w1 = torch.tensor(0.5, requires_grad=True) w2 = torch.tensor(-0.3, requires_grad=True) # Forward (PyTorch records the graph) z1 = w1 * x a1 = torch.relu(z1) z2 = w2 * a1 L = (z2 - y) ** 2 # Backward (one call does everything!) L.backward() w1.grad # tensor(1.56) — same as manual! w2.grad # tensor(-2.60) — same as manual!
Source: Paszke et al. (2017) “Automatic differentiation in PyTorch” introduced the dynamic computational graph approach. PyTorch builds a new graph each forward pass, enabling Python-native control flow.