Ch 6: Training Neural Networks

Ch 6 — Training Neural Networks

Backpropagation, gradient descent, optimizers, and the tricks that make deep learning work

Index Under the Hood →

High Level

arrow_forward

Forward

arrow_forward

functions

Loss

arrow_forward

arrow_back

Backprop

arrow_forward

trending_down

Optimize

arrow_forward

tune

Regularize

arrow_forward

speed

Scale

Click play or press Space to begin the journey...

Step- / 8

loop

The Training Loop

Forward pass, compute loss, backward pass, update weights — repeat

How Neural Networks Learn

Training is an iterative process. Feed data forward through the network, measure how wrong the predictions are (loss), compute which direction to adjust each weight (gradients), then nudge the weights to reduce the error. Repeat millions of times.

# The training loop for epoch in range(num_epochs): for batch in training_data: 1. Forward pass predictions = model(batch.inputs) 2. Compute loss loss = loss_fn(predictions, batch.targets) 3. Backward pass (backpropagation) gradients = compute_gradients(loss) 4. Update weights weights -= learning_rate × gradients 5. Zero gradients reset gradients to zero

Key Terminology

Epoch: One complete pass through the entire training dataset
Batch: A subset of training examples processed together (e.g., 32 or 64 samples)
Iteration: One weight update (one batch processed)
Learning rate: How big a step to take when updating weights (e.g., 0.001)

# Example scale Dataset: 50,000 images Batch size: 64 Iterations per epoch: 50,000 / 64 = 781 Epochs: 100 Total updates: 781 × 100 = 78,100

Why batches? Processing the full dataset at once is too slow and memory-intensive. Mini-batches give a noisy but useful gradient estimate, enable GPU parallelism, and actually help generalization — the noise acts as implicit regularization.

arrow_back

Backpropagation

The algorithm that made deep learning possible

The Core Idea

Backpropagation answers: “How much did each weight contribute to the error?” It uses the chain rule from calculus to efficiently compute the gradient of the loss with respect to every weight in the network, working backward from the output layer to the input layer.

The Chain Rule

If y = f(g(x)), then dy/dx = f’(g(x)) × g’(x). In a neural network, the loss depends on the output, which depends on the hidden layers, which depend on the weights. The chain rule lets us decompose this into local gradients that multiply together.

# Backprop in 3 sentences: 1. Forward: Compute output and loss 2. Backward: For each layer (last to first): • Compute ∂Loss/∂weights using chain rule • Pass gradient to previous layer 3. Update: Adjust all weights simultaneously # Chain rule example: ∂Loss/∂w = ∂Loss/∂output × ∂output/∂hidden × ∂hidden/∂w # Each factor is a "local gradient" # computed at each layer independently

Historical note: Backpropagation was popularized by Rumelhart, Hinton, and Williams in their landmark 1986 Nature paper. The algorithm itself was discovered independently by multiple researchers (Werbos 1974, Linnainmaa 1970), but the 1986 paper demonstrated it could train multi-layer networks effectively — solving the XOR problem from Ch 5.

trending_down

Gradient Descent Variants

SGD, mini-batch, and the landscape of optimization

The Optimization Landscape

Imagine the loss function as a mountainous landscape. Each point represents a set of weights; the height is the loss. Training is like descending to the lowest valley while blindfolded — you can only feel the slope beneath your feet (the gradient) and take steps downhill.

# Gradient descent variants Batch GD Compute gradient on entire dataset Stable but slow, needs all data in memory Stochastic GD (SGD) Compute gradient on 1 sample Noisy but fast, good for online learning Mini-batch GD ← standard practice Compute gradient on batch (32–512 samples) Best of both: stable enough, fast enough Enables GPU parallelism

Challenges

Local minima: Getting stuck in a valley that isn’t the deepest
Saddle points: Flat regions where gradients are near zero
Ravines: Narrow valleys where the gradient oscillates
Learning rate: Too high = overshoot; too low = painfully slow

Good news: In high-dimensional spaces (millions of parameters), local minima are rarely a problem. Most critical points are saddle points, and SGD’s noise helps escape them. Research shows that most local minima in deep networks have loss values close to the global minimum.

speed

Modern Optimizers

Momentum, RMSProp, Adam — smarter ways to descend

# SGD with Momentum velocity = β × velocity + gradient weights -= η × velocity # Like a ball rolling downhill — # accumulates speed in consistent direction # RMSProp (Hinton, 2012) cache = β × cache + (1-β) × gradient² weights -= η × gradient / √cache # Adapts learning rate per parameter # Larger updates for rare features # Adam (Kingma & Ba, 2015) m = β\u2081 × m + (1-β\u2081) × gradient v = β\u2082 × v + (1-β\u2082) × gradient² weights -= η × m̂ / (√v̂ + ε) # Combines momentum + adaptive rates # Default choice for most tasks

Which Optimizer to Use?

Adam: Default starting point. Works well out of the box with minimal tuning. Used for most deep learning tasks.

SGD + Momentum: Often achieves better final performance with careful tuning. Preferred for image classification (ResNets) and when you have time to tune.

AdamW: Adam with decoupled weight decay. Standard for transformer training (BERT, GPT).

Learning rate is the most important hyperparameter. Too high: training diverges (loss explodes). Too low: training is painfully slow. Learning rate schedulers reduce the rate over time — start fast for coarse adjustments, slow down for fine-tuning. Cosine annealing and warmup+decay are standard.

tune

Regularization

Preventing overfitting — making models generalize

The Overfitting Problem

A model that memorizes training data but fails on new data is overfitting. The training loss keeps decreasing but validation loss starts increasing. Regularization techniques constrain the model to find simpler solutions that generalize better.

# Regularization techniques Dropout Randomly zero out neurons during training Probability p = 0.1–0.5 (layer-dependent) Forces redundancy, prevents co-adaptation Weight Decay (L2) Add λ·∑w² to loss function Penalizes large weights, shrinks toward zero Early Stopping Monitor validation loss during training Stop when val loss starts increasing Use the weights from the best epoch Data Augmentation Create variations of training data (Ch 4) More diverse data = better generalization

Overfitting

Train loss: 0.01
Val loss: 2.50
Model memorized training data. Performs terribly on new data.

Good Fit

Train loss: 0.15
Val loss: 0.18
Small gap between train and val. Generalizes well to new data.

Batch Normalization (Ioffe & Szegedy, 2015) normalizes layer inputs to have zero mean and unit variance during training. It stabilizes training, allows higher learning rates, and acts as mild regularization. Used in almost every modern CNN and many other architectures.

layers

Batch Norm & Residual Connections

The two innovations that enabled truly deep networks

Batch Normalization

Normalizes each layer’s inputs across the batch to have mean 0 and variance 1, then applies learnable scale (γ) and shift (β) parameters. This stabilizes the distribution of activations, enabling faster training and higher learning rates.

# Batch normalization μ = mean(batch) σ² = variance(batch) x̂ = (x - μ) / √(σ² + ε) y = γ × x̂ + β # learnable params # Placed after linear/conv layer, # before activation function

Residual Connections (Skip Connections)

He et al. (2015) introduced ResNet with a simple idea: instead of learning y = F(x), learn y = F(x) + x. The “+x” is a skip connection that lets gradients flow directly through the network. This solved the degradation problem — deeper networks were performing worse than shallow ones before ResNets.

# Residual block output = F(x) + x # skip connection # If F(x) learns nothing useful, # the block just passes x through. # Worst case: identity mapping. # This makes depth "free" — adding # layers can only help, never hurt.

ResNet enabled 152-layer networks that outperformed 20-layer networks. Before ResNets, networks deeper than ~20 layers actually got worse. Skip connections are now used everywhere — transformers, U-Nets, diffusion models. They’re arguably the most important architectural innovation in deep learning.

memory

Training at Scale

GPUs, distributed training, and mixed precision

Why GPUs?

Neural network training is dominated by matrix multiplications. CPUs process operations sequentially. GPUs have thousands of cores that process matrix operations in parallel. A single GPU can be 10–100x faster than a CPU for deep learning. Modern training uses clusters of thousands of GPUs.

# Training scale comparison MNIST classifier (50K params) Hardware: 1 GPU, minutes ResNet-50 (25M params) Hardware: 1–8 GPUs, hours BERT (340M params) Hardware: 16 TPUs, 4 days GPT-3 (175B params) Hardware: ~1000 GPUs, months Cost: ~$4.6M GPT-4 (~1.8T params) Hardware: ~25,000 GPUs, months Cost: ~$100M+ (estimated)

Scaling Techniques

Data parallelism: Split batches across multiple GPUs, average gradients
Model parallelism: Split the model across GPUs (when it doesn’t fit on one)
Pipeline parallelism: Different layers on different GPUs, pipelined
Mixed precision: Use float16 for forward/backward, float32 for weight updates. 2x memory savings, 2–3x speed boost.

The bitter lesson (Rich Sutton, 2019): General methods that leverage computation scale better than clever hand-designed approaches. More compute + more data + simple algorithms consistently wins. This is why scaling laws and GPU clusters dominate modern AI research.

check_circle

Putting It All Together

The modern training recipe

# Modern training recipe 1. Architecture Choose model (MLP, CNN, Transformer) Add batch norm + residual connections 2. Initialization He init for ReLU, Xavier for sigmoid/tanh 3. Optimizer Adam or AdamW (lr = 1e-3 to 3e-4) Or SGD + momentum for vision tasks 4. Learning rate schedule Warmup + cosine decay Or reduce on plateau 5. Regularization Dropout (0.1–0.3), weight decay (1e-4) Data augmentation, early stopping 6. Monitor Track train/val loss curves Watch for overfitting gap

Key Takeaways

1. Training = forward pass + loss + backprop + weight update, repeated

2. Backpropagation uses the chain rule to compute gradients efficiently

3. Adam is the default optimizer; learning rate is the most important hyperparameter

4. Dropout, weight decay, and early stopping prevent overfitting

5. Batch normalization and residual connections enabled deep networks

6. GPUs are essential; modern models require thousands of GPUs

7. Scale (compute + data) is the dominant factor in modern AI

Coming up: Ch 7 applies these training techniques to CNNs (images), Ch 8 to RNNs (sequences), and Ch 9 to transformers (attention). The training fundamentals from this chapter underpin everything that follows.