The Analogy
Regularization is like a skeptical advisor who says “don’t trust extreme conclusions from limited data.” L2 regularization says “weights should be small” (Gaussian prior). L1 says “most weights should be zero” (Laplace prior). Dropout says “don’t rely on any single neuron” (approximate Bayesian model averaging).
Key insight: Every regularization technique has a Bayesian interpretation. L2 = Gaussian prior. L1 = Laplace prior. Dropout ≈ approximate Bayesian inference over an ensemble of sub-networks. Early stopping = implicit regularization (limiting the effective complexity of the model).
Regularization Zoo
# L2 (weight decay) = Gaussian prior
# Loss = NLL + λ‖w‖²
# Shrinks all weights toward zero
# L1 (Lasso) = Laplace prior
# Loss = NLL + λ‖w‖₁
# Pushes many weights to exactly zero
# Dropout = approximate Bayesian ensemble
layer = nn.Dropout(p=0.1)
# Randomly zeros 10% of activations
# ≈ averaging over 2^n sub-networks
# Data augmentation = expanding the prior
# "I believe rotated cats are still cats"
Practical rule: More data = less regularization needed. With infinite data, MLE and MAP converge. With small data, strong priors (heavy regularization) prevent overfitting.