Ch 5: Chain Rule & Backpropagation — Mathematics Behind AI & ML

restaurant

The Blame Chain

Bad meal? Blame the chef. Chef blames the supplier. Supplier blames the farmer.

The Analogy

You eat a terrible meal at a restaurant. You blame the chef. The chef says “the ingredients were bad” and blames the supplier. The supplier says “the crops were poor” and blames the farmer. Each person passes blame backward to whoever gave them their input. Backpropagation does exactly this: it traces blame (error) backward through the network, telling each weight how much it contributed to the final mistake.

Key insight: The “blame” each weight receives IS its gradient. A weight that contributed a lot to the error gets a large gradient (lots of blame) and gets adjusted more. A weight that barely affected the output gets a tiny gradient and barely changes.

The Blame Chain in a Network

# Neural network as a blame chain: # # Input → [Layer 1] → [Layer 2] → [Layer 3] → Loss # x → w₁·x+b₁ → w₂·h₁+b₂ → w₃·h₂+b₃ → L # # Backprop traces blame backward: # Loss → How much did Layer 3 cause this? # → How much did Layer 2 cause this? # → How much did Layer 1 cause this? # # Each layer's "blame" = its gradient # ∂L/∂w₃, ∂L/∂w₂, ∂L/∂w₁

Real World

Bad meal → blame chef → blame supplier → blame farmer

In AI

High loss → blame layer 3 → blame layer 2 → blame layer 1

link

The Chain Rule — Multiplying Blame

If A affects B, and B affects C, then A affects C through B

The Analogy

If doubling the fertilizer doubles the crop yield, and doubling the crop yield doubles the restaurant quality, then doubling the fertilizer quadruples the restaurant quality. The effects multiply through the chain. That’s the chain rule: dC/dA = dC/dB × dB/dA.

Key insight: Backpropagation IS the chain rule. Nothing more, nothing less. It’s just the chain rule applied systematically to a computational graph. Every “deep learning breakthrough” in training ultimately relies on this one calculus rule from the 1600s.

Worked Example

# Chain rule: df/dx = df/du × du/dx # Example: f(x) = (3x + 2)² # Let u = 3x + 2, then f = u² # df/du = 2u # du/dx = 3 # df/dx = 2u × 3 = 6(3x + 2) # At x = 1: u = 5 # df/dx = 6 × 5 = 30 # Multi-step chain: # dL/dw₁ = dL/dh₃ × dh₃/dh₂ × dh₂/dh₁ × dh₁/dw₁ # Each × is one link in the blame chain

Formula: For f(g(h(x))): df/dx = df/dg × dg/dh × dh/dx. Each factor is one “link” in the chain. More layers = more multiplications.

account_tree

Computational Graphs

Drawing the recipe so we can trace blame backward

The Analogy

A computational graph is like a recipe flowchart. Each node is an operation (add, multiply, square). Edges show what feeds into what. To trace blame backward, you follow the arrows in reverse, multiplying local derivatives at each step. PyTorch builds this graph automatically during the forward pass.

Key insight: PyTorch’s requires_grad=True tells the system “record every operation on this tensor.” During the forward pass, PyTorch secretly builds a DAG (directed acyclic graph) of all operations. When you call .backward(), it walks this graph in reverse, applying the chain rule at every node.

Worked Example

# Computational graph for: L = (w×x + b)² # # w ──┐ # ├── [×] ── z ──┐ # x ──┘ ├── [+] ── h ── [²] ── L # b ───┘ # # Forward: z=w×x, h=z+b, L=h² # Backward (chain rule): # dL/dh = 2h # dL/db = dL/dh × dh/db = 2h × 1 = 2h # dL/dz = dL/dh × dh/dz = 2h × 1 = 2h # dL/dw = dL/dz × dz/dw = 2h × x

Real World

Recipe flowchart: ingredients → steps → dish

In AI

Computational graph: inputs → operations → loss

arrow_forward

Forward Pass — Computing the Output

Run the recipe from inputs to loss

The Analogy

The forward pass is cooking the meal: take ingredients (input), follow the recipe (network operations), produce the dish (output), and taste it (compute loss). During this process, PyTorch saves every intermediate result — like a security camera recording every step of the cooking process, so you can review the footage later to figure out what went wrong.

Key insight: The forward pass must save intermediate values (activations) because the backward pass needs them. This is why training uses ~2–3× more memory than inference — you’re storing the “security footage” for backpropagation. Gradient checkpointing trades compute for memory by recomputing instead of storing.

Worked Example with Numbers

# Forward pass: L = (w×x + b)² # w=2, x=3, b=1 w, x, b = 2, 3, 1 # Step 1: z = w × x = 2 × 3 = 6 z = w * x # 6 (saved!) # Step 2: h = z + b = 6 + 1 = 7 h = z + b # 7 (saved!) # Step 3: L = h² = 7² = 49 L = h ** 2 # 49 (the loss) # All intermediates saved for backward pass

arrow_back

Backward Pass — Tracing Blame

Walk the graph in reverse, multiplying local gradients

The Analogy

Now you review the security footage in reverse. Start at the loss and ask: “How much did the last step contribute?” Then: “How much did the step before that contribute?” At each node, you compute the local derivative and multiply it with the incoming blame. By the time you reach the weights, each one knows exactly how much it’s to blame.

Key insight: The backward pass visits each node exactly once, computing one local derivative and one multiplication. For a network with N operations, backprop costs roughly 2× the forward pass — NOT N× more. This efficiency is why deep learning is practical at all.

Worked Example with Numbers

# Backward pass (continuing from forward) # w=2, x=3, b=1, z=6, h=7, L=49 # Start: dL/dL = 1 (trivially) # Step 3 backward: L = h² dL_dh = 2 * h # 2×7 = 14 # Step 2 backward: h = z + b dL_dz = dL_dh * 1 # 14 × 1 = 14 dL_db = dL_dh * 1 # 14 × 1 = 14 # Step 1 backward: z = w × x dL_dw = dL_dz * x # 14 × 3 = 42 dL_dx = dL_dz * w # 14 × 2 = 28 # Result: ∂L/∂w = 42, ∂L/∂b = 14 # w gets 3× more blame than b!

calculate

Backprop Through a Real Layer

A complete worked example with a 2-layer network

The Setup

Let’s trace backprop through a tiny 2-layer network with ReLU activation. Input x = 2, target y = 1, MSE loss. This is the exact process happening billions of times during training.

# 2-layer network: x → w₁ → ReLU → w₂ → loss x = 2.0; y = 1.0 w1 = 0.5; w2 = -0.3 # Forward: z1 = w1 * x # 0.5 × 2 = 1.0 a1 = max(0, z1) # ReLU(1.0) = 1.0 z2 = w2 * a1 # -0.3 × 1.0 = -0.3 L = (z2 - y)**2 # (-0.3-1)² = 1.69

Backward Pass

# Backward: dL_dz2 = 2*(z2-y) # 2×(-1.3) = -2.6 dL_dw2 = dL_dz2 * a1 # -2.6 × 1.0 = -2.6 dL_da1 = dL_dz2 * w2 # -2.6 × -0.3 = 0.78 dL_dz1 = dL_da1 * 1 # 0.78 (ReLU grad=1) dL_dw1 = dL_dz1 * x # 0.78 × 2 = 1.56 # Update (lr=0.1): w1 -= 0.1 * 1.56 # 0.5 → 0.344 w2 -= 0.1 * (-2.6) # -0.3 → -0.04 # w2 moved toward positive (reducing error)

Notice: w2 got a negative gradient (−2.6), so it increased (moved toward positive), which makes the output closer to the target y = 1. The math automatically figures out which direction each weight should move.

swap_horiz

Forward vs. Reverse Mode Autodiff

Why backprop is reverse mode — and why it’s efficient

The Analogy

Forward mode: For each input, trace its influence forward through every output. If you have 1 million inputs and 1 output, that’s 1 million forward passes. Reverse mode: For each output, trace blame backward through all inputs. With 1 output (loss) and 1 million inputs (weights), that’s just 1 backward pass. Neural networks have many inputs (weights) and one output (loss), so reverse mode wins massively.

Key insight: This is why backpropagation (reverse mode) was revolutionary. Computing all 175 billion gradients of GPT-3 costs roughly 2–3× one forward pass. Forward mode would cost 175 billion forward passes. Reverse mode made deep learning computationally feasible.

Complexity Comparison

# Forward mode autodiff: # Cost = O(n × forward_pass) # where n = number of input parameters # GPT-3: 175B × forward_pass 😱 # Reverse mode autodiff (backprop): # Cost = O(m × forward_pass) # where m = number of outputs (usually 1) # GPT-3: 1 × forward_pass × ~2-3 ✓ # Forward mode wins when: # few inputs, many outputs (rare in ML) # Reverse mode wins when: # many inputs, few outputs (always in ML!)

Forward Mode

Ask each ingredient: “how do you affect the final dish?” (1M questions)

Reverse Mode

Ask the dish: “which ingredients caused this?” (1 question, all answers)

auto_awesome

PyTorch Autograd — Automatic Backprop

You write the forward pass; PyTorch handles the backward pass

The Analogy

PyTorch’s autograd is like having a robotic accountant who watches you cook, records every step, and automatically calculates exactly how much each ingredient contributed to the final taste. You just cook (write the forward pass). The accountant handles all the blame-tracing (backward pass) automatically.

Why it matters for AI: Before autograd, researchers had to manually derive and implement gradients for every new architecture. PyTorch’s dynamic computational graph means you can use Python control flow (if/else, loops) and autograd still works. This is why PyTorch became the dominant research framework.

In Practice

import torch # Same 2-layer example, but PyTorch does it x = torch.tensor(2.0) y = torch.tensor(1.0) w1 = torch.tensor(0.5, requires_grad=True) w2 = torch.tensor(-0.3, requires_grad=True) # Forward (PyTorch records the graph) z1 = w1 * x a1 = torch.relu(z1) z2 = w2 * a1 L = (z2 - y) ** 2 # Backward (one call does everything!) L.backward() w1.grad # tensor(1.56) — same as manual! w2.grad # tensor(-2.60) — same as manual!

Source: Paszke et al. (2017) “Automatic differentiation in PyTorch” introduced the dynamic computational graph approach. PyTorch builds a new graph each forward pass, enabling Python-native control flow.

Ch 5 — Chain Rule & Backpropagation