Ch 4 — Convolutional Neural Networks

Convolutions, filters, pooling, stride, padding, and feature maps
High Level
image
Input
arrow_forward
grid_view
Conv
arrow_forward
bolt
Activate
arrow_forward
compress
Pool
arrow_forward
layers
Stack
arrow_forward
category
Classify
-
Click play or press Space to begin...
Step- / 8
visibility
Why MLPs Fail at Images
The problem that motivated convolutions
The Curse of Pixels
A 224×224 RGB image has 150,528 input values. A fully connected layer with 1,000 hidden neurons would need 150 million weights — just for the first layer. This is wildly impractical: too many parameters to train, too much memory, and no spatial awareness. An MLP treats pixel (0,0) and pixel (223,223) identically — it has no concept of “nearby” or “pattern.” Shifting an object by one pixel creates an entirely different input vector.
Three Key Principles of CNNs
// 1. Local connectivity // Each neuron sees only a small patch // (not the entire image) // 2. Weight sharing // Same filter applied across all positions // → translation equivariance // 3. Spatial hierarchy // Early layers: edges, textures // Middle layers: parts (eyes, wheels) // Late layers: objects (faces, cars)
Key insight: CNNs exploit the structure of images: nearby pixels are correlated, the same pattern can appear anywhere, and complex features are built from simpler ones. These three priors reduce parameters by orders of magnitude.
grid_view
The Convolution Operation
Sliding a filter across an image
How Convolution Works
A filter (or kernel) is a small matrix of learnable weights, typically 3×3 or 5×5. It slides across the input image, computing a dot product at each position between the filter weights and the overlapping image patch. The result is a feature map — a 2D grid showing where and how strongly the filter’s pattern appears in the image. A vertical edge filter, for example, produces high values wherever vertical edges exist.
Convolution Math
// 3×3 filter on a 5×5 input Input (5×5): Filter (3×3): 1 0 1 0 1 1 0 -1 0 1 0 1 0 1 0 -1 1 0 1 0 1 1 0 -1 0 1 0 1 0 1 0 1 0 1 // At position (0,0): output[0,0] = 1·1 + 0·0 + 1·(-1) + 0·1 + 1·0 + 0·(-1) + 1·1 + 0·0 + 1·(-1) = 0
Key insight: A 3×3 filter has only 9 learnable parameters, yet it can detect a specific pattern anywhere in the image. This weight sharing is what makes CNNs so parameter-efficient compared to MLPs.
open_with
Stride & Padding
Controlling output size and border behavior
Stride
Stride controls how many pixels the filter moves between positions. Stride 1 means the filter moves one pixel at a time (default). Stride 2 means it skips every other position, halving the output dimensions. Larger strides reduce computation and spatial size but lose fine-grained detail. Stride-2 convolutions are often used instead of pooling in modern architectures.
Padding
Padding adds zeros around the input border. Without padding (“valid”), a 3×3 filter on a 5×5 input produces a 3×3 output — the image shrinks. With “same” padding (1 pixel of zeros), the output stays 5×5. Padding preserves spatial dimensions and ensures edge pixels get equal treatment.
Output Size Formula
// Output dimension formula O = (W - K + 2P) / S + 1 // W = input size, K = kernel size // P = padding, S = stride // Example: 32×32 input, 3×3 kernel stride=1, pad=0: (32-3+0)/1+1 = 30 stride=1, pad=1: (32-3+2)/1+1 = 32 ← same stride=2, pad=1: (32-3+2)/2+1 = 16 ← halved
Rule of thumb: For a 3×3 kernel, padding=1 with stride=1 preserves spatial dimensions. This “same” convolution is the most common configuration in modern architectures like ResNet.
compress
Pooling Layers
Downsampling for invariance and efficiency
What Pooling Does
Pooling reduces the spatial dimensions of feature maps, making the network more efficient and providing a degree of translation invariance. Max pooling (most common) takes the maximum value in each window — keeping the strongest activation regardless of its exact position. Average pooling takes the mean. A 2×2 max pool with stride 2 halves both width and height, reducing the number of values by 75%.
Max Pooling Example
// 2×2 max pooling, stride 2 Input (4×4): Output (2×2): 1 3 | 2 1 5 2 5 2 | 0 2 → --------- 8 6 4 8 | 1 6 3 1 | 5 4 // Each 2×2 block → its maximum value // No learnable parameters
Key insight: Global Average Pooling (GAP), introduced in Network-in-Network (Lin et al., 2014), averages each feature map to a single value. It replaced the large fully connected layers at the end of CNNs, dramatically reducing parameters. ResNet and most modern CNNs use GAP.
layers
Feature Maps & Channels
How CNNs build hierarchical representations
Multiple Filters = Multiple Feature Maps
Each convolutional layer applies multiple filters, each producing its own feature map. If a layer has 64 filters, it outputs 64 feature maps (channels). The next layer’s filters operate across all input channels simultaneously — a 3×3 filter on 64-channel input is actually a 3×3×64 volume with 576 weights. This is how deeper layers combine low-level features into higher-level patterns.
Parameter Count
// Conv layer parameter count params = K × K × C_in × C_out + C_out // filter channels filters biases // Example: 3×3 conv, 64 → 128 channels params = 3 × 3 × 64 × 128 + 128 = 73,856 // vs. equivalent fully connected layer // on 32×32×64 input → 32×32×128 output params = 65,536 × 131,072 = 8.6 billion // !
Key insight: A typical CNN progressively increases channels (3 → 64 → 128 → 256 → 512) while decreasing spatial dimensions (224 → 112 → 56 → 28 → 14). This trades spatial resolution for feature richness.
view_in_ar
The Hierarchical Feature Pyramid
From edges to objects, layer by layer
What Each Layer Learns
Visualization research (Zeiler & Fergus, 2014) revealed what CNN layers actually learn. Layer 1: edges, color gradients, simple textures. Layer 2: corners, curves, repeated patterns. Layer 3: parts of objects (eyes, wheels, windows). Layer 4–5: whole objects and scenes. This hierarchy emerges automatically from training — nobody programs “detect edges.” The network discovers that edges are useful building blocks for recognizing objects.
Key insight: This hierarchical feature learning is why CNNs transfer so well. Features learned on ImageNet (edges, textures, parts) are useful for medical imaging, satellite photos, and any visual task. This is the foundation of transfer learning.
Receptive Field Growth
// Receptive field: how much input // each neuron "sees" Layer 1 (3×3 conv): sees 3×3 pixels Layer 2 (3×3 conv): sees 5×5 pixels Layer 3 (3×3 conv): sees 7×7 pixels ... Layer N: sees (2N+1) × (2N+1) pixels // With pooling, receptive field grows // even faster — deep neurons see // large regions of the original image
history_edu
LeCun’s LeNet (1989–1998)
The CNN that started it all
The Pioneer
In 1989, Yann LeCun at AT&T Bell Labs applied backpropagation to train a CNN for handwritten digit recognition. The refined version, LeNet-5 (1998), processed 32×32 grayscale images through 2 convolutional layers, 2 subsampling (pooling) layers, and 3 fully connected layers. With only ~60,000 parameters, it achieved ~99% accuracy on MNIST digits and was deployed by the US Postal Service to read ZIP codes on mail. LeNet proved that CNNs could solve real-world problems.
LeNet-5 Architecture
// LeNet-5 (LeCun et al., 1998) Input: 32×32×1 (grayscale) Conv1: 5×5, 6 filters → 28×28×6 Pool1: 2×2, stride 2 → 14×14×6 Conv2: 5×5, 16 filters → 10×10×16 Pool2: 2×2, stride 2 → 5×5×16 FC1: 120 neurons FC2: 84 neurons Output: 10 classes (digits 0-9) // Total: ~60,000 parameters // Activation: tanh (pre-ReLU era)
Why it matters: LeNet established the Conv → Pool → Conv → Pool → FC pattern that dominated CNN design for 20 years. Every architecture from AlexNet to ResNet is a descendant of this template.
code
Building a CNN in PyTorch
Putting the pieces together in code
A Modern CNN
import torch.nn as nn class SimpleCNN(nn.Module): def __init__(self): super().__init__() self.features = nn.Sequential( nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d(1), ) self.classifier = nn.Linear(128, 10) def forward(self, x): x = self.features(x) x = x.view(x.size(0), -1) return self.classifier(x)
What's Next
This chapter covered the mechanics of CNNs: convolutions, pooling, stride, padding, and feature hierarchies. The next chapter explores the landmark CNN architectures — AlexNet, VGG, GoogLeNet, and ResNet — that pushed accuracy to superhuman levels and defined the modern era of computer vision.
The connection: CNNs exploit three priors about images: locality, translation equivariance, and compositionality. These same principles appear in other domains — 1D convolutions for audio, graph convolutions for molecules, and the attention mechanism that eventually superseded convolutions for many tasks.