Ch 2: Training Deep Networks — Deep Learning Fundamentals

target

The Learning Objective

Defining what “good” means with a loss function

What Is a Loss Function?

Training a neural network means finding weights that make predictions match reality. A loss function (or cost function) measures how wrong the network is — a single number that quantifies the gap between predicted output ŷ and true label y. The goal of training is to minimize this number. Different tasks use different loss functions, but they all serve the same purpose: turning “how bad is this?” into a differentiable scalar.

Common Loss Functions

// Regression: Mean Squared Error L = (1/n) · Σ(yᵢ - ŷᵢ)² // Binary Classification: Binary Cross-Entropy L = -(1/n) · Σ[yᵢ·log(ŷᵢ) + (1-yᵢ)·log(1-ŷᵢ)] // Multi-class: Cross-Entropy with Softmax L = -Σ yᵢ · log(softmax(zᵢ))

Key insight: The loss function is the only thing the network “sees” during training. A poorly chosen loss function means the network optimizes for the wrong objective, no matter how good the architecture is.

trending_down

Gradient Descent

Walking downhill on the loss landscape

The Core Idea

Imagine the loss function as a hilly landscape where the height represents error. Gradient descent is the algorithm that finds the lowest valley. At each step, it computes the gradient — the direction of steepest ascent — and moves in the opposite direction. The learning rate (η) controls step size: too large and you overshoot; too small and training takes forever.

Why it matters: Gradient descent is the universal training algorithm for neural networks. Every model from a 2-layer MLP to GPT-4 is trained by some variant of this simple idea: compute gradients, step downhill, repeat.

The Update Rule

// Gradient descent update w = w - η · ∂L/∂w // η = learning rate (e.g., 0.001) // ∂L/∂w = gradient of loss w.r.t. weight // Three flavors: Batch GD: all data → 1 update Stochastic GD: 1 sample → 1 update Mini-batch GD: B samples → 1 update ← standard

link

The Chain Rule

The mathematical engine behind backpropagation

Composing Derivatives

A deep network is a composition of functions: f(g(h(x))). To compute how the loss changes when you tweak a weight deep inside the network, you need the chain rule from calculus. If y = f(g(x)), then dy/dx = (df/dg) · (dg/dx). For a network with L layers, the gradient of the loss with respect to a weight in layer 1 is a product of L partial derivatives — one for each layer the signal passes through.

Chain Rule in Action

// Network: x → h₁ → h₂ → ŷ → Loss // How does Loss change w.r.t. w₁? ∂L/∂w₁ = ∂L/∂ŷ // loss → output · ∂ŷ/∂h₂ // output → hidden₂ · ∂h₂/∂h₁ // hidden₂ → hidden₁ · ∂h₁/∂w₁ // hidden₁ → weight // Each factor is a local derivative // Multiply them all → global gradient

Key insight: The chain rule turns a global question (“how does this weight affect the final loss?”) into a series of local questions (“how does each layer affect the next?”). This locality is what makes backpropagation efficient.

account_tree

Computational Graphs

The data structure that makes automatic differentiation possible

What Is a Computational Graph?

A computational graph is a directed acyclic graph (DAG) where each node represents an operation (add, multiply, ReLU) and edges represent data flow. The forward pass flows left to right, computing the output. The backward pass flows right to left, computing gradients using the chain rule at each node. This is exactly how PyTorch and TensorFlow work internally — every tensor operation builds a graph that is later traversed backward.

Example Graph

// y = (x · w + b) then ReLU then Loss x ──┐ ├─→ [multiply] ─→ [add] ─→ [ReLU] ─→ [Loss] w ──┘ ↑ b // Forward: compute values left → right // Backward: compute gradients right → left // Each node stores its local gradient

Key insight: When you call loss.backward() in PyTorch, it traverses this graph in reverse, applying the chain rule at each node. This is called reverse-mode automatic differentiation — it computes all gradients in a single backward pass.

replay

Backpropagation

The 1986 breakthrough that made deep learning possible

Rumelhart, Hinton & Williams (1986)

In October 1986, Rumelhart, Hinton, and Williams published “Learning representations by back-propagating errors” in Nature. They showed that by propagating error signals backward through the network — applying the chain rule layer by layer — you could compute gradients for every weight in every layer efficiently. This solved the problem Minsky identified in 1969: how to train multi-layer networks. The key insight was that hidden units could learn useful internal representations of the data.

The Algorithm

// Backpropagation in 4 steps 1. Forward pass Compute output ŷ from input x 2. Compute loss L = loss_fn(ŷ, y) 3. Backward pass For each layer (output → input): ∂L/∂wₗ = ∂L/∂aₗ · ∂aₗ/∂zₗ · ∂zₗ/∂wₗ // aₗ = activation, zₗ = pre-activation 4. Update weights wₗ = wₗ - η · ∂L/∂wₗ

Why it matters: Backpropagation’s computational cost is roughly 2× the forward pass, regardless of network depth. This O(n) efficiency is what makes training billion-parameter models feasible.

tune

The Training Loop

Epochs, batches, and the rhythm of learning

Anatomy of Training

Training repeats a cycle: split data into mini-batches (typically 32–512 samples), run forward pass on a batch, compute loss, backpropagate gradients, update weights. One pass through the entire dataset is an epoch. Training typically runs for tens to hundreds of epochs. A validation set (held-out data) monitors whether the model is actually learning generalizable patterns or just memorizing training data (overfitting).

PyTorch Training Loop

for epoch in range(num_epochs): for X_batch, y_batch in dataloader: # Forward pass predictions = model(X_batch) loss = loss_fn(predictions, y_batch) # Backward pass optimizer.zero_grad() # clear old grads loss.backward() # compute grads optimizer.step() # update weights

Rule of thumb: zero_grad() before backward() is essential — PyTorch accumulates gradients by default. Forgetting this is one of the most common beginner bugs.

warning

Vanishing & Exploding Gradients

Why deep networks were hard to train for decades

The Problem

The chain rule multiplies gradients across layers. If each layer’s gradient is slightly less than 1 (as with sigmoid), the product shrinks exponentially — gradients in early layers become vanishingly small, and those weights barely update. This is the vanishing gradient problem. Conversely, if gradients are slightly greater than 1, they explode exponentially, causing weights to diverge. Both problems get worse as networks get deeper.

Critical in AI: The vanishing gradient problem is why deep networks were considered impractical before ~2010. It took ReLU activations, better initialization (Xavier/He), batch normalization, and skip connections to finally solve it.

The Math

// 10-layer network with sigmoid // sigmoid derivative max = 0.25 ∂L/∂w₁ = 0.25 × 0.25 × ... × 0.25 (10 times) = 0.25¹⁰ ≈ 0.0000001 // ~10⁻⁷ // With ReLU (derivative = 1 for z > 0) ∂L/∂w₁ = 1 × 1 × ... × 1 = 1 // gradient flows unchanged!

school

Putting It All Together

The complete training picture

The Training Pipeline

1. Define architecture (layers, activations). 2. Initialize weights (Xavier or He initialization). 3. Choose loss function (MSE, cross-entropy). 4. Choose optimizer (SGD, Adam). 5. For each epoch: forward pass → compute loss → backward pass → update weights. 6. Monitor validation loss to detect overfitting. 7. Save the best model checkpoint. This pipeline is universal — it works for image classifiers, language models, and everything in between.

The connection: Backpropagation gave us the ability to train multi-layer networks. But vanilla gradient descent has limitations — it can be slow, get stuck in local minima, and is sensitive to learning rate. The next chapter covers optimizers (Adam, RMSProp) that solve these problems.

Weight Initialization Matters

// Bad: all zeros (symmetry → no learning) w = 0 // Bad: large random (exploding activations) w = random(-1, 1) // Xavier init (Glorot, 2010) — for sigmoid/tanh w ~ N(0, 2/(n_in + n_out)) // He init (2015) — for ReLU w ~ N(0, 2/n_in) // Keeps variance stable across layers

Key insight: Xavier Glorot (2010) and Kaiming He (2015) showed that matching initialization variance to the activation function prevents signals from shrinking or exploding during the forward pass — a prerequisite for stable training.

Ch 2 — Training Deep Networks