The Training Pipeline
1. Define architecture (layers, activations). 2. Initialize weights (Xavier or He initialization). 3. Choose loss function (MSE, cross-entropy). 4. Choose optimizer (SGD, Adam). 5. For each epoch: forward pass → compute loss → backward pass → update weights. 6. Monitor validation loss to detect overfitting. 7. Save the best model checkpoint. This pipeline is universal — it works for image classifiers, language models, and everything in between.
The connection: Backpropagation gave us the ability to train multi-layer networks. But vanilla gradient descent has limitations — it can be slow, get stuck in local minima, and is sensitive to learning rate. The next chapter covers optimizers (Adam, RMSProp) that solve these problems.
Weight Initialization Matters
// Bad: all zeros (symmetry → no learning)
w = 0
// Bad: large random (exploding activations)
w = random(-1, 1)
// Xavier init (Glorot, 2010) — for sigmoid/tanh
w ~ N(0, 2/(n_in + n_out))
// He init (2015) — for ReLU
w ~ N(0, 2/n_in)
// Keeps variance stable across layers
Key insight: Xavier Glorot (2010) and Kaiming He (2015) showed that matching initialization variance to the activation function prevents signals from shrinking or exploding during the forward pass — a prerequisite for stable training.