Ch 5: CNN Architectures — Deep Learning Fundamentals

emoji_events

AlexNet — The Big Bang (2012)

The model that launched the deep learning revolution

The ImageNet Moment

In September 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted AlexNet to the ImageNet Large Scale Visual Recognition Challenge. It achieved a top-5 error of 15.3%, crushing the runner-up (26.2%) by over 10 percentage points. This wasn’t incremental — it was a paradigm shift. The computer vision community, which had relied on hand-crafted features (SIFT, HOG), pivoted overnight to deep learning.

Architecture

// AlexNet (Krizhevsky et al., 2012) Input: 227×227×3 Conv1: 11×11, 96 filters, stride 4 Pool1: 3×3, stride 2 Conv2: 5×5, 256 filters Pool2: 3×3, stride 2 Conv3-5: 3×3 convolutions FC6-7: 4096 neurons each Output: 1000 classes (softmax) // Key innovations: ✓ ReLU (not sigmoid/tanh) ✓ GPU training (2× NVIDIA GTX 580) ✓ Dropout (0.5 in FC layers) ✓ Data augmentation // 60M parameters, trained ~1 week on 2 GPUs

layers

VGGNet — Depth with Simplicity (2014)

Proving that deeper is better with uniform 3×3 convolutions

The VGG Philosophy

Karen Simonyan and Andrew Zisserman at Oxford asked a simple question: what if we just make the network deeper using only 3×3 convolutions? VGG-16 (16 weight layers) and VGG-19 (19 layers) proved that depth matters. Two stacked 3×3 convolutions have the same receptive field as one 5×5 but with fewer parameters and more non-linearities. VGG achieved 7.3% top-5 error on ImageNet.

Key insight: VGG’s uniform architecture (all 3×3 convs, double channels after each pool) made it easy to understand and implement. It became the go-to feature extractor for transfer learning and remains widely used as a backbone today.

VGG-16 Architecture

// VGG-16 (Simonyan & Zisserman, 2014) Block 1: 2× Conv(3×3, 64) + Pool → 112×112 Block 2: 2× Conv(3×3, 128) + Pool → 56×56 Block 3: 3× Conv(3×3, 256) + Pool → 28×28 Block 4: 3× Conv(3×3, 512) + Pool → 14×14 Block 5: 3× Conv(3×3, 512) + Pool → 7×7 FC: 4096 → 4096 → 1000 // 138M parameters (most in FC layers) // Two 3×3 = same receptive field as 5×5 // but 2×(3²) = 18 vs 5² = 25 params

account_tree

GoogLeNet / Inception (2014)

Going wider, not just deeper

The Inception Module

Google’s GoogLeNet (Szegedy et al., 2015) introduced the Inception module: instead of choosing one filter size, use multiple sizes in parallel (1×1, 3×3, 5×5) and concatenate their outputs. 1×1 convolutions act as “bottleneck” layers that reduce channel dimensions before expensive 3×3 and 5×5 operations. GoogLeNet achieved 6.7% top-5 error with only 5 million parameters — 27× fewer than VGG.

Inception Module

// Inception module (simplified) Input ──┬── 1×1 conv ──────────┐ ├── 1×1 → 3×3 conv ───┤ ├── 1×1 → 5×5 conv ───┤ → Concat └── 3×3 pool → 1×1 ───┘ // 1×1 convolutions = "bottleneck" // Reduce channels before expensive ops // 256ch → 1×1(64) → 3×3(64) = cheap // vs 256ch → 3×3(64) = expensive

Key insight: The 1×1 convolution is one of the most important ideas in CNN design. It mixes channels without changing spatial dimensions, acting as a learnable dimensionality reduction. It’s used in nearly every modern architecture.

shortcut

ResNet — Skip Connections (2015)

The breakthrough that made 152-layer networks possible

The Degradation Problem

By 2015, researchers found a paradox: making networks deeper should improve accuracy, but in practice, very deep networks performed worse than shallower ones — even on training data. This wasn’t overfitting; it was an optimization problem. Kaiming He et al. at Microsoft Research solved it with residual learning: instead of learning a function H(x), learn the residual F(x) = H(x) - x. The output becomes F(x) + x, where the “+ x” is a skip connection (identity shortcut).

Key insight: Skip connections let gradients flow directly through the identity path, bypassing layers entirely. If a layer isn’t useful, the network can learn F(x) = 0, effectively skipping it. This makes deeper networks at least as good as shallower ones.

Residual Block

// Residual block Input x ──┬── Conv → BN → ReLU ──┐ │ Conv → BN │ │ ↓ │ └── identity ──→ [ADD] → ReLU → Output // output = F(x) + x // F(x) = learned residual // x = skip connection (identity) // ResNet-152: 3.57% top-5 error // Won ILSVRC 2015 (1st in 5 tracks) // 8× deeper than VGG, lower complexity

view_in_ar

ResNet Variants & Bottleneck Blocks

Scaling residual networks efficiently

The Bottleneck Design

For deeper ResNets (50, 101, 152 layers), He et al. used bottleneck blocks: a 1×1 conv reduces channels, a 3×3 conv processes the reduced representation, and another 1×1 conv restores channels. This 1×1 → 3×3 → 1×1 pattern is far more parameter-efficient than two 3×3 convolutions. ResNet-50 has 25.6M parameters — less than VGG-16’s 138M despite being 3× deeper.

ResNet Family

// ResNet variants ResNet-18: 11.7M params, 69.8% top-1 ResNet-34: 21.8M params, 73.3% top-1 ResNet-50: 25.6M params, 76.1% top-1 ← sweet spot ResNet-101: 44.5M params, 77.4% top-1 ResNet-152: 60.2M params, 78.3% top-1 // Bottleneck block (ResNet-50+) 1×1 conv (256→64) // reduce 3×3 conv (64→64) // process 1×1 conv (64→256) // restore + skip connection

Rule of thumb: ResNet-50 is the most commonly used variant — it offers the best trade-off between accuracy and compute. It’s the default backbone for object detection (Faster R-CNN), segmentation (Mask R-CNN), and many transfer learning tasks.

auto_awesome

Post-ResNet Architectures

DenseNet, EfficientNet, and the search for efficiency

DenseNet (Huang et al., 2017)

DenseNet takes skip connections to the extreme: every layer connects to every subsequent layer. Layer N receives feature maps from all layers 0 through N-1 via concatenation. This encourages feature reuse, reduces parameters, and strengthens gradient flow. DenseNet-121 achieves similar accuracy to ResNet-50 with fewer parameters.

EfficientNet (Tan & Le, 2019)

EfficientNet used neural architecture search (NAS) to find an optimal base network, then compound scaling to uniformly scale depth, width, and resolution together. EfficientNet-B7 achieved 84.3% top-1 accuracy with 8.4× fewer parameters than the best existing models. It showed that how you scale matters as much as the architecture itself.

Architecture Comparison

// ImageNet top-1 accuracy vs params AlexNet (2012): 57% / 60M params VGG-16 (2014): 71% / 138M params GoogLeNet (2014): 75% / 5M params ResNet-50 (2015): 76% / 26M params DenseNet (2017): 77% / 8M params EfficientB0(2019): 77% / 5.3M params EfficientB7(2019): 84% / 66M params // Trend: more accuracy per parameter

Key insight: The trend is clear: each generation achieves higher accuracy with fewer parameters. The innovations that drove this — skip connections, bottlenecks, compound scaling — are design principles, not just architectures.

swap_horiz

Transfer Learning & Pretrained Models

Why you almost never train a CNN from scratch

The Transfer Learning Revolution

Training a CNN from scratch requires millions of labeled images and days of GPU time. Transfer learning sidesteps this: take a model pretrained on ImageNet (1.2M images, 1000 classes), freeze or fine-tune its convolutional layers, and replace the final classifier for your task. The pretrained features (edges, textures, parts) transfer remarkably well to medical imaging, satellite photos, manufacturing defect detection, and virtually any visual domain.

PyTorch Transfer Learning

import torchvision.models as models // Load pretrained ResNet-50 model = models.resnet50( weights='IMAGENET1K_V2' ) // Replace final layer for 5 classes model.fc = nn.Linear(2048, 5) // Option A: Fine-tune all layers // Option B: Freeze backbone, train FC only for param in model.parameters(): param.requires_grad = False model.fc.requires_grad_(True)

Rule of thumb: With <1000 images, freeze the backbone and train only the classifier. With 1K–10K images, fine-tune the last few layers. With 10K+ images, fine-tune the entire network with a small learning rate.

school

Lessons & What’s Next

The design principles that endure

Enduring Design Principles

1. Depth matters — deeper networks learn richer representations (VGG, ResNet). 2. Skip connections are essential — they solve the degradation problem and enable gradient flow (ResNet). 3. Bottleneck layers save compute — 1×1 convolutions reduce dimensionality cheaply (GoogLeNet, ResNet). 4. Batch normalization stabilizes training — used in every modern architecture. 5. Transfer learning is the default — pretrained features generalize across domains.

The connection: CNNs dominated vision from 2012 to ~2020. Then Vision Transformers (ViT) showed that attention mechanisms could match or beat CNNs on images. But the principles — hierarchy, skip connections, bottlenecks — carry over. Next: Recurrent Neural Networks for sequence data.

The ImageNet Timeline

// ImageNet top-5 error rate 2010: 28.2% (hand-crafted features) 2011: 25.8% (hand-crafted features) 2012: 15.3% AlexNet ← deep learning era 2013: 11.7% ZFNet 2014: 6.7% GoogLeNet 2015: 3.6% ResNet ← surpasses human (5.1%) 2017: 2.3% SENet // Human-level: ~5.1% (Russakovsky, 2015)

Ch 5 — CNN Architectures