Practical Guidelines
1. Start with AdamW (lr=3e-4, wd=0.01) as your baseline. 2. Use linear warmup for the first 5–10% of training steps. 3. Use cosine annealing to decay to ~10% of peak LR. 4. If training is unstable, reduce LR or increase warmup. 5. For fine-tuning pretrained models, use 10–100× smaller LR than pretraining. 6. Monitor both training and validation loss — divergence means overfitting.
The connection: Optimizers determine how efficiently a network learns. With training mechanics covered (loss, backprop, optimizers), the next chapter shifts to architecture: Convolutional Neural Networks — the breakthrough that taught machines to see.
The Evolution
// The optimizer family tree
SGD (1951)
├─ + Momentum (1964, Polyak)
│ └─ Nesterov (1983)
├─ AdaGrad (2011, Duchi)
│ └─ RMSProp (2012, Hinton)
│ └─ Adam (2015, Kingma & Ba)
│ ├─ AdamW (2019, Loshchilov)
│ └─ RAdam (2020, Liu)
└─ LAMB, LARS // for large-batch training