The Degradation Problem
By 2015, researchers found a paradox: making networks deeper should improve accuracy, but in practice, very deep networks performed worse than shallower ones — even on training data. This wasn’t overfitting; it was an optimization problem. Kaiming He et al. at Microsoft Research solved it with residual learning: instead of learning a function H(x), learn the residual F(x) = H(x) - x. The output becomes F(x) + x, where the “+ x” is a skip connection (identity shortcut).
Key insight: Skip connections let gradients flow directly through the identity path, bypassing layers entirely. If a layer isn’t useful, the network can learn F(x) = 0, effectively skipping it. This makes deeper networks at least as good as shallower ones.
Residual Block
// Residual block
Input x ──┬── Conv → BN → ReLU ──┐
│ Conv → BN │
│ ↓ │
└── identity ──→ [ADD] → ReLU → Output
// output = F(x) + x
// F(x) = learned residual
// x = skip connection (identity)
// ResNet-152: 3.57% top-5 error
// Won ILSVRC 2015 (1st in 5 tracks)
// 8× deeper than VGG, lower complexity