Ch 10 — Regularization & Practical Training

Dropout, batch normalization, layer normalization, data augmentation, and training recipes
High Level
fitness_center
Overfit
arrow_forward
casino
Dropout
arrow_forward
equalizer
BatchNorm
arrow_forward
photo_library
Augment
arrow_forward
stop_circle
Early Stop
arrow_forward
menu_book
Recipes
-
Click play or press Space to begin...
Step- / 8
fitness_center
The Overfitting Problem
When your model memorizes instead of learning
Bias vs. Variance
Overfitting occurs when a model performs well on training data but poorly on unseen data — it has memorized the training examples rather than learning generalizable patterns. Deep networks with millions of parameters are especially prone to overfitting because they have enough capacity to memorize entire datasets. The bias-variance tradeoff: high bias (underfitting) means the model is too simple; high variance (overfitting) means it’s too complex. Regularization techniques reduce variance without significantly increasing bias.
Detecting Overfitting
// Signs of overfitting Training loss: ↓ decreasing Validation loss: ↑ increasing ← gap! // Signs of underfitting Training loss: → still high Validation loss: → still high // Good fit Training loss: ↓ decreasing Validation loss: ↓ decreasing (close to train)
Critical in AI: Modern deep learning has a surprising twist: very large models can be “double descent” — overfitting peaks at a certain model size, then decreases as the model gets even larger. This is why GPT-4 generalizes well despite having trillions of parameters.
casino
Dropout
Srivastava et al. (2014) — randomly silencing neurons
How Dropout Works
During training, dropout randomly sets each neuron’s output to zero with probability p (typically 0.5 for FC layers, 0.1–0.3 for conv layers). This prevents neurons from co-adapting — no neuron can rely on any specific other neuron being present. It’s like training an ensemble of 2ⁿ different sub-networks (where n is the number of neurons). At inference time, all neurons are active but outputs are scaled by (1-p) to compensate. PyTorch handles this automatically with nn.Dropout.
PyTorch Dropout
class Net(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(784, 256) self.drop = nn.Dropout(p=0.5) self.fc2 = nn.Linear(256, 10) def forward(self, x): x = F.relu(self.fc1(x)) x = self.drop(x) // active in train return self.fc2(x) // model.train() → dropout active // model.eval() → dropout disabled
Key insight: Dropout is equivalent to training an exponentially large ensemble of models that share weights. At test time, using all neurons with scaled weights approximates averaging the predictions of all sub-networks.
equalizer
Batch Normalization
Ioffe & Szegedy (2015) — normalizing layer inputs
What BatchNorm Does
Batch Normalization normalizes each layer’s inputs to have zero mean and unit variance across the mini-batch, then applies learnable scale (γ) and shift (β) parameters. This addresses internal covariate shift — the distribution of layer inputs changing as earlier layers update. BatchNorm allows higher learning rates, reduces sensitivity to initialization, and acts as a mild regularizer. It was the single most impactful technique for training deep CNNs and is used in virtually every modern architecture.
BatchNorm Equations
// Batch Normalization μ_B = (1/m) · Σ xᵢ // batch mean σ²_B = (1/m) · Σ (xᵢ - μ_B)² // batch var x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε) // normalize yᵢ = γ · x̂ᵢ + β // scale & shift // γ, β are learnable parameters // At inference: use running mean/var // (not batch statistics) // Placement: Conv → BN → ReLU
Key insight: BatchNorm depends on batch statistics, which makes it problematic for small batches and sequence models. Layer Normalization (Ba et al., 2016) normalizes across features instead of across the batch, making it batch-size independent. LayerNorm is the standard for transformers.
layers
Layer Norm, Group Norm & RMSNorm
Normalization variants for different architectures
Normalization Zoo
Layer Norm normalizes across all features for each sample independently — no batch dependency. Standard for transformers (GPT, BERT, LLaMA). Group Norm (Wu & He, 2018) divides channels into groups and normalizes within each group — works well for small batches in detection/segmentation. RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing the mean centering, using only root-mean-square normalization. LLaMA and many modern LLMs use RMSNorm for efficiency.
Comparison
// Normalization comparison BatchNorm: across batch, per channel → CNNs (ResNet, EfficientNet) LayerNorm: across features, per sample → Transformers (GPT, BERT) GroupNorm: across channel groups, per sample → Detection (small batches) RMSNorm: like LayerNorm, no mean centering → Modern LLMs (LLaMA, Gemma) // Pre-norm (before attention) vs post-norm // Pre-norm is standard for modern transformers
Key insight: The choice of normalization depends on the architecture: BatchNorm for CNNs, LayerNorm for transformers, GroupNorm for small-batch settings. Getting this wrong can significantly hurt performance.
photo_library
Data Augmentation
Creating more training data from existing data
Why Augmentation Works
Data augmentation applies random transformations to training images (flips, rotations, crops, color jitter) to artificially increase dataset size and diversity. A dataset of 10K images with augmentation can behave like 100K+ images. This is one of the most effective regularization techniques — it directly addresses the root cause of overfitting (insufficient data diversity). Modern augmentation strategies like RandAugment (Cubuk et al., 2020) and Mixup (Zhang et al., 2018) have become standard.
Common Augmentations
// PyTorch augmentation pipeline from torchvision import transforms train_transform = transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ColorJitter( brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1 ), transforms.RandAugment(), transforms.ToTensor(), transforms.Normalize( mean=[.485,.456,.406], std=[.229,.224,.225] ), ])
Rule of thumb: Always use augmentation for image tasks. For NLP, augmentation is harder but includes back-translation, synonym replacement, and random deletion. For LLM pretraining, the massive dataset size makes traditional augmentation unnecessary.
stop_circle
Early Stopping & Weight Decay
Knowing when to stop and penalizing complexity
Early Stopping
Early stopping monitors validation loss during training and stops when it starts increasing (while training loss continues decreasing). This is the simplest and most effective way to prevent overfitting. Typically, you save the model checkpoint with the best validation loss and use patience (wait N epochs after the best score before stopping) to avoid stopping too early due to noise.
Weight Decay (L2 Regularization)
Weight decay adds a penalty proportional to the squared magnitude of weights: L_total = L_data + λ||w||². This discourages large weights, pushing the model toward simpler solutions. In AdamW, weight decay is decoupled from the gradient update (λ typically 0.01–0.1). Weight decay is used in virtually every deep learning training recipe.
Early Stopping in Practice
// Early stopping with patience best_val_loss = float('inf') patience = 10 counter = 0 for epoch in range(max_epochs): train(model) val_loss = validate(model) if val_loss < best_val_loss: best_val_loss = val_loss save_checkpoint(model) counter = 0 else: counter += 1 if counter >= patience: break // stop training
Rule of thumb: For fine-tuning pretrained models, use patience=3-5. For training from scratch, use patience=10-20. Always save the best checkpoint, not the last one.
science
Mixup, CutMix & Label Smoothing
Advanced regularization techniques
Mixup (Zhang et al., 2018)
Mixup creates new training examples by linearly interpolating between pairs of images and their labels: x_new = λ·x₁ + (1-λ)·x₂, y_new = λ·y₁ + (1-λ)·y₂. This encourages the model to behave linearly between training examples, producing smoother decision boundaries. CutMix (Yun et al., 2019) cuts a rectangular patch from one image and pastes it onto another, mixing labels proportionally to the area. Both significantly improve generalization on ImageNet.
Label Smoothing
Label smoothing (Szegedy et al., 2016) replaces hard labels [0, 0, 1, 0] with soft labels [0.033, 0.033, 0.9, 0.033]. Instead of pushing the model to be 100% confident, it encourages calibrated uncertainty. This prevents the model from becoming overconfident and improves generalization. Label smoothing of 0.1 is standard for ImageNet training and transformer models.
Key insight: These techniques share a common principle: they prevent the model from being too certain about any single training example. By introducing controlled ambiguity, they force the model to learn more robust, generalizable features.
menu_book
Modern Training Recipes
Putting it all together
The Standard Recipe
Modern training recipes combine multiple techniques. For ImageNet CNNs: AdamW or SGD+momentum, cosine LR schedule, RandAugment, Mixup/CutMix, label smoothing 0.1, weight decay 0.05, and stochastic depth. For transformers/LLMs: AdamW (lr=3e-4, wd=0.1), linear warmup + cosine decay, gradient clipping, RMSNorm or LayerNorm, and dropout 0.1. These recipes have been refined through thousands of experiments and represent hard-won practical knowledge.
The connection: Regularization makes deep networks generalize. With these tools — dropout, normalization, augmentation, early stopping — we can train models with billions of parameters that still generalize to unseen data. Next: the attention mechanism that replaced recurrence and enabled transformers.
Regularization Checklist
// Modern training checklist ✓ Weight decay (0.01-0.1) ✓ Dropout (0.1-0.5) ✓ Data augmentation (RandAugment) ✓ Normalization (BN/LN/RMSNorm) ✓ LR schedule (warmup + cosine) ✓ Early stopping (patience 5-20) ✓ Label smoothing (0.1) ✓ Mixup/CutMix (for vision) ✓ Gradient clipping (max_norm=1.0) // Not all are needed for every task // Start simple, add as needed