LeNet-5 (LeCun, 1998)
7 layers, 60K params
Handwritten digits (MNIST)
First practical CNN
AlexNet (Krizhevsky, 2012)
8 layers, 60M params
ImageNet top-5 error: 15.3% (vs 26% prev)
ReLU, dropout, GPU training
VGGNet (Simonyan, 2014)
16-19 layers, 138M params
Uniform 3×3 convolutions
Proved depth matters
GoogLeNet (Szegedy, 2014)
22 layers, 6.8M params
Inception modules (multi-scale)
Efficient parameter usage
ResNet (He, 2015)
50-152 layers, 25-60M params
Skip connections (y = F(x) + x)
3.57% top-5 error (superhuman!)
EfficientNet (Tan, 2019)
Compound scaling (depth+width+resolution)
5.3M params, better than ResNet
Neural architecture search (NAS)
Vision Transformer (Dosovitskiy, 2020)
No convolutions at all!
Split image into patches, use attention
Matches/beats CNNs with enough data
The trend: ImageNet top-5 error went from 26% (2011) to 3.57% (2015, ResNet) — surpassing human performance (~5%). The key innovations: depth (VGG), multi-scale (Inception), skip connections (ResNet), efficient scaling (EfficientNet), and now attention (ViT).