Ch 5 — CNN Architectures

AlexNet, VGG, GoogLeNet, ResNet — the models that defined modern computer vision
High Level
history
AlexNet
arrow_forward
layers
VGG
arrow_forward
account_tree
GoogLeNet
arrow_forward
shortcut
ResNet
arrow_forward
auto_awesome
Modern
arrow_forward
school
Lessons
-
Click play or press Space to begin...
Step- / 8
emoji_events
AlexNet — The Big Bang (2012)
The model that launched the deep learning revolution
The ImageNet Moment
In September 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted AlexNet to the ImageNet Large Scale Visual Recognition Challenge. It achieved a top-5 error of 15.3%, crushing the runner-up (26.2%) by over 10 percentage points. This wasn’t incremental — it was a paradigm shift. The computer vision community, which had relied on hand-crafted features (SIFT, HOG), pivoted overnight to deep learning.
Architecture
// AlexNet (Krizhevsky et al., 2012) Input: 227×227×3 Conv1: 11×11, 96 filters, stride 4 Pool1: 3×3, stride 2 Conv2: 5×5, 256 filters Pool2: 3×3, stride 2 Conv3-5: 3×3 convolutions FC6-7: 4096 neurons each Output: 1000 classes (softmax) // Key innovations: ✓ ReLU (not sigmoid/tanh) ✓ GPU training (2× NVIDIA GTX 580) ✓ Dropout (0.5 in FC layers) ✓ Data augmentation // 60M parameters, trained ~1 week on 2 GPUs
layers
VGGNet — Depth with Simplicity (2014)
Proving that deeper is better with uniform 3×3 convolutions
The VGG Philosophy
Karen Simonyan and Andrew Zisserman at Oxford asked a simple question: what if we just make the network deeper using only 3×3 convolutions? VGG-16 (16 weight layers) and VGG-19 (19 layers) proved that depth matters. Two stacked 3×3 convolutions have the same receptive field as one 5×5 but with fewer parameters and more non-linearities. VGG achieved 7.3% top-5 error on ImageNet.
Key insight: VGG’s uniform architecture (all 3×3 convs, double channels after each pool) made it easy to understand and implement. It became the go-to feature extractor for transfer learning and remains widely used as a backbone today.
VGG-16 Architecture
// VGG-16 (Simonyan & Zisserman, 2014) Block 1: 2× Conv(3×3, 64) + Pool → 112×112 Block 2: 2× Conv(3×3, 128) + Pool → 56×56 Block 3: 3× Conv(3×3, 256) + Pool → 28×28 Block 4: 3× Conv(3×3, 512) + Pool → 14×14 Block 5: 3× Conv(3×3, 512) + Pool → 7×7 FC: 4096 → 4096 → 1000 // 138M parameters (most in FC layers) // Two 3×3 = same receptive field as 5×5 // but 2×(3²) = 18 vs 5² = 25 params
account_tree
GoogLeNet / Inception (2014)
Going wider, not just deeper
The Inception Module
Google’s GoogLeNet (Szegedy et al., 2015) introduced the Inception module: instead of choosing one filter size, use multiple sizes in parallel (1×1, 3×3, 5×5) and concatenate their outputs. 1×1 convolutions act as “bottleneck” layers that reduce channel dimensions before expensive 3×3 and 5×5 operations. GoogLeNet achieved 6.7% top-5 error with only 5 million parameters — 27× fewer than VGG.
Inception Module
// Inception module (simplified) Input ──┬── 1×1 conv ──────────┐ ├── 1×1 → 3×3 conv ───┤ ├── 1×1 → 5×5 conv ───┤ → Concat └── 3×3 pool → 1×1 ───┘ // 1×1 convolutions = "bottleneck" // Reduce channels before expensive ops // 256ch → 1×1(64) → 3×3(64) = cheap // vs 256ch → 3×3(64) = expensive
Key insight: The 1×1 convolution is one of the most important ideas in CNN design. It mixes channels without changing spatial dimensions, acting as a learnable dimensionality reduction. It’s used in nearly every modern architecture.
shortcut
ResNet — Skip Connections (2015)
The breakthrough that made 152-layer networks possible
The Degradation Problem
By 2015, researchers found a paradox: making networks deeper should improve accuracy, but in practice, very deep networks performed worse than shallower ones — even on training data. This wasn’t overfitting; it was an optimization problem. Kaiming He et al. at Microsoft Research solved it with residual learning: instead of learning a function H(x), learn the residual F(x) = H(x) - x. The output becomes F(x) + x, where the “+ x” is a skip connection (identity shortcut).
Key insight: Skip connections let gradients flow directly through the identity path, bypassing layers entirely. If a layer isn’t useful, the network can learn F(x) = 0, effectively skipping it. This makes deeper networks at least as good as shallower ones.
Residual Block
// Residual block Input x ──┬── Conv → BN → ReLU ──┐ │ Conv → BN │ │ ↓ │ └── identity ──→ [ADD] → ReLU → Output // output = F(x) + x // F(x) = learned residual // x = skip connection (identity) // ResNet-152: 3.57% top-5 error // Won ILSVRC 2015 (1st in 5 tracks) // 8× deeper than VGG, lower complexity
view_in_ar
ResNet Variants & Bottleneck Blocks
Scaling residual networks efficiently
The Bottleneck Design
For deeper ResNets (50, 101, 152 layers), He et al. used bottleneck blocks: a 1×1 conv reduces channels, a 3×3 conv processes the reduced representation, and another 1×1 conv restores channels. This 1×1 → 3×3 → 1×1 pattern is far more parameter-efficient than two 3×3 convolutions. ResNet-50 has 25.6M parameters — less than VGG-16’s 138M despite being 3× deeper.
ResNet Family
// ResNet variants ResNet-18: 11.7M params, 69.8% top-1 ResNet-34: 21.8M params, 73.3% top-1 ResNet-50: 25.6M params, 76.1% top-1 ← sweet spot ResNet-101: 44.5M params, 77.4% top-1 ResNet-152: 60.2M params, 78.3% top-1 // Bottleneck block (ResNet-50+) 1×1 conv (256→64) // reduce 3×3 conv (64→64) // process 1×1 conv (64→256) // restore + skip connection
Rule of thumb: ResNet-50 is the most commonly used variant — it offers the best trade-off between accuracy and compute. It’s the default backbone for object detection (Faster R-CNN), segmentation (Mask R-CNN), and many transfer learning tasks.
auto_awesome
Post-ResNet Architectures
DenseNet, EfficientNet, and the search for efficiency
DenseNet (Huang et al., 2017)
DenseNet takes skip connections to the extreme: every layer connects to every subsequent layer. Layer N receives feature maps from all layers 0 through N-1 via concatenation. This encourages feature reuse, reduces parameters, and strengthens gradient flow. DenseNet-121 achieves similar accuracy to ResNet-50 with fewer parameters.
EfficientNet (Tan & Le, 2019)
EfficientNet used neural architecture search (NAS) to find an optimal base network, then compound scaling to uniformly scale depth, width, and resolution together. EfficientNet-B7 achieved 84.3% top-1 accuracy with 8.4× fewer parameters than the best existing models. It showed that how you scale matters as much as the architecture itself.
Architecture Comparison
// ImageNet top-1 accuracy vs params AlexNet (2012): 57% / 60M params VGG-16 (2014): 71% / 138M params GoogLeNet (2014): 75% / 5M params ResNet-50 (2015): 76% / 26M params DenseNet (2017): 77% / 8M params EfficientB0(2019): 77% / 5.3M params EfficientB7(2019): 84% / 66M params // Trend: more accuracy per parameter
Key insight: The trend is clear: each generation achieves higher accuracy with fewer parameters. The innovations that drove this — skip connections, bottlenecks, compound scaling — are design principles, not just architectures.
swap_horiz
Transfer Learning & Pretrained Models
Why you almost never train a CNN from scratch
The Transfer Learning Revolution
Training a CNN from scratch requires millions of labeled images and days of GPU time. Transfer learning sidesteps this: take a model pretrained on ImageNet (1.2M images, 1000 classes), freeze or fine-tune its convolutional layers, and replace the final classifier for your task. The pretrained features (edges, textures, parts) transfer remarkably well to medical imaging, satellite photos, manufacturing defect detection, and virtually any visual domain.
PyTorch Transfer Learning
import torchvision.models as models // Load pretrained ResNet-50 model = models.resnet50( weights='IMAGENET1K_V2' ) // Replace final layer for 5 classes model.fc = nn.Linear(2048, 5) // Option A: Fine-tune all layers // Option B: Freeze backbone, train FC only for param in model.parameters(): param.requires_grad = False model.fc.requires_grad_(True)
Rule of thumb: With <1000 images, freeze the backbone and train only the classifier. With 1K–10K images, fine-tune the last few layers. With 10K+ images, fine-tune the entire network with a small learning rate.
school
Lessons & What’s Next
The design principles that endure
Enduring Design Principles
1. Depth matters — deeper networks learn richer representations (VGG, ResNet). 2. Skip connections are essential — they solve the degradation problem and enable gradient flow (ResNet). 3. Bottleneck layers save compute — 1×1 convolutions reduce dimensionality cheaply (GoogLeNet, ResNet). 4. Batch normalization stabilizes training — used in every modern architecture. 5. Transfer learning is the default — pretrained features generalize across domains.
The connection: CNNs dominated vision from 2012 to ~2020. Then Vision Transformers (ViT) showed that attention mechanisms could match or beat CNNs on images. But the principles — hierarchy, skip connections, bottlenecks — carry over. Next: Recurrent Neural Networks for sequence data.
The ImageNet Timeline
// ImageNet top-5 error rate 2010: 28.2% (hand-crafted features) 2011: 25.8% (hand-crafted features) 2012: 15.3% AlexNet ← deep learning era 2013: 11.7% ZFNet 2014: 6.7% GoogLeNet 2015: 3.6% ResNet ← surpasses human (5.1%) 2017: 2.3% SENet // Human-level: ~5.1% (Russakovsky, 2015)