Ch 10: Regularization & Practical Training — Deep Learning Fundamentals

Ch 10 — Regularization & Practical Training

Dropout, batch normalization, layer normalization, data augmentation, and training recipes

Index

High Level

fitness_center

Overfit

arrow_forward

casino

Dropout

arrow_forward

equalizer

BatchNorm

arrow_forward

photo_library

Augment

arrow_forward

stop_circle

Early Stop

arrow_forward

menu_book

Recipes

Click play or press Space to begin...

Step- / 8

fitness_center

The Overfitting Problem

When your model memorizes instead of learning

Bias vs. Variance

Overfitting occurs when a model performs well on training data but poorly on unseen data — it has memorized the training examples rather than learning generalizable patterns. Deep networks with millions of parameters are especially prone to overfitting because they have enough capacity to memorize entire datasets. The bias-variance tradeoff: high bias (underfitting) means the model is too simple; high variance (overfitting) means it’s too complex. Regularization techniques reduce variance without significantly increasing bias.

Detecting Overfitting

// Signs of overfitting Training loss: ↓ decreasing Validation loss: ↑ increasing ← gap! // Signs of underfitting Training loss: → still high Validation loss: → still high // Good fit Training loss: ↓ decreasing Validation loss: ↓ decreasing (close to train)

Critical in AI: Modern deep learning has a surprising twist: very large models can be “double descent” — overfitting peaks at a certain model size, then decreases as the model gets even larger. This is why GPT-4 generalizes well despite having trillions of parameters.

casino

Dropout

Srivastava et al. (2014) — randomly silencing neurons

How Dropout Works

During training, dropout randomly sets each neuron’s output to zero with probability p (typically 0.5 for FC layers, 0.1–0.3 for conv layers). This prevents neurons from co-adapting — no neuron can rely on any specific other neuron being present. It’s like training an ensemble of 2ⁿ different sub-networks (where n is the number of neurons). At inference time, all neurons are active but outputs are scaled by (1-p) to compensate. PyTorch handles this automatically with nn.Dropout.

PyTorch Dropout

class Net(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(784, 256) self.drop = nn.Dropout(p=0.5) self.fc2 = nn.Linear(256, 10) def forward(self, x): x = F.relu(self.fc1(x)) x = self.drop(x) // active in train return self.fc2(x) // model.train() → dropout active // model.eval() → dropout disabled

Key insight: Dropout is equivalent to training an exponentially large ensemble of models that share weights. At test time, using all neurons with scaled weights approximates averaging the predictions of all sub-networks.

equalizer

Batch Normalization

Ioffe & Szegedy (2015) — normalizing layer inputs

What BatchNorm Does

Batch Normalization normalizes each layer’s inputs to have zero mean and unit variance across the mini-batch, then applies learnable scale (γ) and shift (β) parameters. This addresses internal covariate shift — the distribution of layer inputs changing as earlier layers update. BatchNorm allows higher learning rates, reduces sensitivity to initialization, and acts as a mild regularizer. It was the single most impactful technique for training deep CNNs and is used in virtually every modern architecture.

BatchNorm Equations

// Batch Normalization μ_B = (1/m) · Σ xᵢ // batch mean σ²_B = (1/m) · Σ (xᵢ - μ_B)² // batch var x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε) // normalize yᵢ = γ · x̂ᵢ + β // scale & shift // γ, β are learnable parameters // At inference: use running mean/var // (not batch statistics) // Placement: Conv → BN → ReLU

Key insight: BatchNorm depends on batch statistics, which makes it problematic for small batches and sequence models. Layer Normalization (Ba et al., 2016) normalizes across features instead of across the batch, making it batch-size independent. LayerNorm is the standard for transformers.

layers

Layer Norm, Group Norm & RMSNorm

Normalization variants for different architectures

Normalization Zoo

Layer Norm normalizes across all features for each sample independently — no batch dependency. Standard for transformers (GPT, BERT, LLaMA). Group Norm (Wu & He, 2018) divides channels into groups and normalizes within each group — works well for small batches in detection/segmentation. RMSNorm (Zhang & Sennrich, 2019) simplifies LayerNorm by removing the mean centering, using only root-mean-square normalization. LLaMA and many modern LLMs use RMSNorm for efficiency.

Comparison

// Normalization comparison BatchNorm: across batch, per channel → CNNs (ResNet, EfficientNet) LayerNorm: across features, per sample → Transformers (GPT, BERT) GroupNorm: across channel groups, per sample → Detection (small batches) RMSNorm: like LayerNorm, no mean centering → Modern LLMs (LLaMA, Gemma) // Pre-norm (before attention) vs post-norm // Pre-norm is standard for modern transformers

Key insight: The choice of normalization depends on the architecture: BatchNorm for CNNs, LayerNorm for transformers, GroupNorm for small-batch settings. Getting this wrong can significantly hurt performance.

photo_library

Data Augmentation

Creating more training data from existing data

Why Augmentation Works

Data augmentation applies random transformations to training images (flips, rotations, crops, color jitter) to artificially increase dataset size and diversity. A dataset of 10K images with augmentation can behave like 100K+ images. This is one of the most effective regularization techniques — it directly addresses the root cause of overfitting (insufficient data diversity). Modern augmentation strategies like RandAugment (Cubuk et al., 2020) and Mixup (Zhang et al., 2018) have become standard.

Common Augmentations

// PyTorch augmentation pipeline from torchvision import transforms train_transform = transforms.Compose([ transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip(), transforms.ColorJitter( brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1 ), transforms.RandAugment(), transforms.ToTensor(), transforms.Normalize( mean=[.485,.456,.406], std=[.229,.224,.225] ), ])

Rule of thumb: Always use augmentation for image tasks. For NLP, augmentation is harder but includes back-translation, synonym replacement, and random deletion. For LLM pretraining, the massive dataset size makes traditional augmentation unnecessary.

stop_circle

Early Stopping & Weight Decay

Knowing when to stop and penalizing complexity

Early Stopping

Early stopping monitors validation loss during training and stops when it starts increasing (while training loss continues decreasing). This is the simplest and most effective way to prevent overfitting. Typically, you save the model checkpoint with the best validation loss and use patience (wait N epochs after the best score before stopping) to avoid stopping too early due to noise.

Weight Decay (L2 Regularization)

Weight decay adds a penalty proportional to the squared magnitude of weights: L_total = L_data + λ||w||². This discourages large weights, pushing the model toward simpler solutions. In AdamW, weight decay is decoupled from the gradient update (λ typically 0.01–0.1). Weight decay is used in virtually every deep learning training recipe.

Early Stopping in Practice

// Early stopping with patience best_val_loss = float('inf') patience = 10 counter = 0 for epoch in range(max_epochs): train(model) val_loss = validate(model) if val_loss < best_val_loss: best_val_loss = val_loss save_checkpoint(model) counter = 0 else: counter += 1 if counter >= patience: break // stop training

Rule of thumb: For fine-tuning pretrained models, use patience=3-5. For training from scratch, use patience=10-20. Always save the best checkpoint, not the last one.

science

Mixup, CutMix & Label Smoothing

Advanced regularization techniques

Mixup (Zhang et al., 2018)

Mixup creates new training examples by linearly interpolating between pairs of images and their labels: x_new = λ·x₁ + (1-λ)·x₂, y_new = λ·y₁ + (1-λ)·y₂. This encourages the model to behave linearly between training examples, producing smoother decision boundaries. CutMix (Yun et al., 2019) cuts a rectangular patch from one image and pastes it onto another, mixing labels proportionally to the area. Both significantly improve generalization on ImageNet.

Label Smoothing

Label smoothing (Szegedy et al., 2016) replaces hard labels [0, 0, 1, 0] with soft labels [0.033, 0.033, 0.9, 0.033]. Instead of pushing the model to be 100% confident, it encourages calibrated uncertainty. This prevents the model from becoming overconfident and improves generalization. Label smoothing of 0.1 is standard for ImageNet training and transformer models.

Key insight: These techniques share a common principle: they prevent the model from being too certain about any single training example. By introducing controlled ambiguity, they force the model to learn more robust, generalizable features.

menu_book

Modern Training Recipes

Putting it all together

The Standard Recipe

Modern training recipes combine multiple techniques. For ImageNet CNNs: AdamW or SGD+momentum, cosine LR schedule, RandAugment, Mixup/CutMix, label smoothing 0.1, weight decay 0.05, and stochastic depth. For transformers/LLMs: AdamW (lr=3e-4, wd=0.1), linear warmup + cosine decay, gradient clipping, RMSNorm or LayerNorm, and dropout 0.1. These recipes have been refined through thousands of experiments and represent hard-won practical knowledge.

The connection: Regularization makes deep networks generalize. With these tools — dropout, normalization, augmentation, early stopping — we can train models with billions of parameters that still generalize to unseen data. Next: the attention mechanism that replaced recurrence and enabled transformers.

Regularization Checklist

// Modern training checklist ✓ Weight decay (0.01-0.1) ✓ Dropout (0.1-0.5) ✓ Data augmentation (RandAugment) ✓ Normalization (BN/LN/RMSNorm) ✓ LR schedule (warmup + cosine) ✓ Early stopping (patience 5-20) ✓ Label smoothing (0.1) ✓ Mixup/CutMix (for vision) ✓ Gradient clipping (max_norm=1.0) // Not all are needed for every task // Start simple, add as needed

arrow_back Ch 9: Generative Adversarial Networks Ch 11: The Attention Mechanism arrow_forward