Batch Normalization
Normalizes each layer’s inputs across the batch to have mean 0 and variance 1, then applies learnable scale (γ) and shift (β) parameters. This stabilizes the distribution of activations, enabling faster training and higher learning rates.
# Batch normalization
μ = mean(batch)
σ² = variance(batch)
x̂ = (x - μ) / √(σ² + ε)
y = γ × x̂ + β # learnable params
# Placed after linear/conv layer,
# before activation function
Residual Connections (Skip Connections)
He et al. (2015) introduced ResNet with a simple idea: instead of learning y = F(x), learn y = F(x) + x. The “+x” is a skip connection that lets gradients flow directly through the network. This solved the degradation problem — deeper networks were performing worse than shallow ones before ResNets.
# Residual block
output = F(x) + x # skip connection
# If F(x) learns nothing useful,
# the block just passes x through.
# Worst case: identity mapping.
# This makes depth "free" — adding
# layers can only help, never hurt.
ResNet enabled 152-layer networks that outperformed 20-layer networks. Before ResNets, networks deeper than ~20 layers actually got worse. Skip connections are now used everywhere — transformers, U-Nets, diffusion models. They’re arguably the most important architectural innovation in deep learning.