Ch 4: Convolutional Neural Networks — Deep Learning Fundamentals

visibility

Why MLPs Fail at Images

The problem that motivated convolutions

The Curse of Pixels

A 224×224 RGB image has 150,528 input values. A fully connected layer with 1,000 hidden neurons would need 150 million weights — just for the first layer. This is wildly impractical: too many parameters to train, too much memory, and no spatial awareness. An MLP treats pixel (0,0) and pixel (223,223) identically — it has no concept of “nearby” or “pattern.” Shifting an object by one pixel creates an entirely different input vector.

Three Key Principles of CNNs

// 1. Local connectivity // Each neuron sees only a small patch // (not the entire image) // 2. Weight sharing // Same filter applied across all positions // → translation equivariance // 3. Spatial hierarchy // Early layers: edges, textures // Middle layers: parts (eyes, wheels) // Late layers: objects (faces, cars)

Key insight: CNNs exploit the structure of images: nearby pixels are correlated, the same pattern can appear anywhere, and complex features are built from simpler ones. These three priors reduce parameters by orders of magnitude.

grid_view

The Convolution Operation

Sliding a filter across an image

How Convolution Works

A filter (or kernel) is a small matrix of learnable weights, typically 3×3 or 5×5. It slides across the input image, computing a dot product at each position between the filter weights and the overlapping image patch. The result is a feature map — a 2D grid showing where and how strongly the filter’s pattern appears in the image. A vertical edge filter, for example, produces high values wherever vertical edges exist.

Convolution Math

// 3×3 filter on a 5×5 input Input (5×5): Filter (3×3): 1 0 1 0 1 1 0 -1 0 1 0 1 0 1 0 -1 1 0 1 0 1 1 0 -1 0 1 0 1 0 1 0 1 0 1 // At position (0,0): output[0,0] = 1·1 + 0·0 + 1·(-1) + 0·1 + 1·0 + 0·(-1) + 1·1 + 0·0 + 1·(-1) = 0

Key insight: A 3×3 filter has only 9 learnable parameters, yet it can detect a specific pattern anywhere in the image. This weight sharing is what makes CNNs so parameter-efficient compared to MLPs.

open_with

Stride & Padding

Controlling output size and border behavior

Stride

Stride controls how many pixels the filter moves between positions. Stride 1 means the filter moves one pixel at a time (default). Stride 2 means it skips every other position, halving the output dimensions. Larger strides reduce computation and spatial size but lose fine-grained detail. Stride-2 convolutions are often used instead of pooling in modern architectures.

Padding

Padding adds zeros around the input border. Without padding (“valid”), a 3×3 filter on a 5×5 input produces a 3×3 output — the image shrinks. With “same” padding (1 pixel of zeros), the output stays 5×5. Padding preserves spatial dimensions and ensures edge pixels get equal treatment.

Output Size Formula

// Output dimension formula O = (W - K + 2P) / S + 1 // W = input size, K = kernel size // P = padding, S = stride // Example: 32×32 input, 3×3 kernel stride=1, pad=0: (32-3+0)/1+1 = 30 stride=1, pad=1: (32-3+2)/1+1 = 32 ← same stride=2, pad=1: (32-3+2)/2+1 = 16 ← halved

Rule of thumb: For a 3×3 kernel, padding=1 with stride=1 preserves spatial dimensions. This “same” convolution is the most common configuration in modern architectures like ResNet.

compress

Pooling Layers

Downsampling for invariance and efficiency

What Pooling Does

Pooling reduces the spatial dimensions of feature maps, making the network more efficient and providing a degree of translation invariance. Max pooling (most common) takes the maximum value in each window — keeping the strongest activation regardless of its exact position. Average pooling takes the mean. A 2×2 max pool with stride 2 halves both width and height, reducing the number of values by 75%.

Max Pooling Example

// 2×2 max pooling, stride 2 Input (4×4): Output (2×2): 1 3 | 2 1 5 2 5 2 | 0 2 → --------- 8 6 4 8 | 1 6 3 1 | 5 4 // Each 2×2 block → its maximum value // No learnable parameters

Key insight: Global Average Pooling (GAP), introduced in Network-in-Network (Lin et al., 2014), averages each feature map to a single value. It replaced the large fully connected layers at the end of CNNs, dramatically reducing parameters. ResNet and most modern CNNs use GAP.

layers

Feature Maps & Channels

How CNNs build hierarchical representations

Multiple Filters = Multiple Feature Maps

Each convolutional layer applies multiple filters, each producing its own feature map. If a layer has 64 filters, it outputs 64 feature maps (channels). The next layer’s filters operate across all input channels simultaneously — a 3×3 filter on 64-channel input is actually a 3×3×64 volume with 576 weights. This is how deeper layers combine low-level features into higher-level patterns.

Parameter Count

// Conv layer parameter count params = K × K × C_in × C_out + C_out // filter channels filters biases // Example: 3×3 conv, 64 → 128 channels params = 3 × 3 × 64 × 128 + 128 = 73,856 // vs. equivalent fully connected layer // on 32×32×64 input → 32×32×128 output params = 65,536 × 131,072 = 8.6 billion // !

Key insight: A typical CNN progressively increases channels (3 → 64 → 128 → 256 → 512) while decreasing spatial dimensions (224 → 112 → 56 → 28 → 14). This trades spatial resolution for feature richness.

view_in_ar

The Hierarchical Feature Pyramid

From edges to objects, layer by layer

What Each Layer Learns

Visualization research (Zeiler & Fergus, 2014) revealed what CNN layers actually learn. Layer 1: edges, color gradients, simple textures. Layer 2: corners, curves, repeated patterns. Layer 3: parts of objects (eyes, wheels, windows). Layer 4–5: whole objects and scenes. This hierarchy emerges automatically from training — nobody programs “detect edges.” The network discovers that edges are useful building blocks for recognizing objects.

Key insight: This hierarchical feature learning is why CNNs transfer so well. Features learned on ImageNet (edges, textures, parts) are useful for medical imaging, satellite photos, and any visual task. This is the foundation of transfer learning.

Receptive Field Growth

// Receptive field: how much input // each neuron "sees" Layer 1 (3×3 conv): sees 3×3 pixels Layer 2 (3×3 conv): sees 5×5 pixels Layer 3 (3×3 conv): sees 7×7 pixels ... Layer N: sees (2N+1) × (2N+1) pixels // With pooling, receptive field grows // even faster — deep neurons see // large regions of the original image

history_edu

LeCun’s LeNet (1989–1998)

The CNN that started it all

The Pioneer

In 1989, Yann LeCun at AT&T Bell Labs applied backpropagation to train a CNN for handwritten digit recognition. The refined version, LeNet-5 (1998), processed 32×32 grayscale images through 2 convolutional layers, 2 subsampling (pooling) layers, and 3 fully connected layers. With only ~60,000 parameters, it achieved ~99% accuracy on MNIST digits and was deployed by the US Postal Service to read ZIP codes on mail. LeNet proved that CNNs could solve real-world problems.

LeNet-5 Architecture

// LeNet-5 (LeCun et al., 1998) Input: 32×32×1 (grayscale) Conv1: 5×5, 6 filters → 28×28×6 Pool1: 2×2, stride 2 → 14×14×6 Conv2: 5×5, 16 filters → 10×10×16 Pool2: 2×2, stride 2 → 5×5×16 FC1: 120 neurons FC2: 84 neurons Output: 10 classes (digits 0-9) // Total: ~60,000 parameters // Activation: tanh (pre-ReLU era)

Why it matters: LeNet established the Conv → Pool → Conv → Pool → FC pattern that dominated CNN design for 20 years. Every architecture from AlexNet to ResNet is a descendant of this template.

code

Building a CNN in PyTorch

Putting the pieces together in code

A Modern CNN

import torch.nn as nn class SimpleCNN(nn.Module): def __init__(self): super().__init__() self.features = nn.Sequential( nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d(1), ) self.classifier = nn.Linear(128, 10) def forward(self, x): x = self.features(x) x = x.view(x.size(0), -1) return self.classifier(x)

What's Next

This chapter covered the mechanics of CNNs: convolutions, pooling, stride, padding, and feature hierarchies. The next chapter explores the landmark CNN architectures — AlexNet, VGG, GoogLeNet, and ResNet — that pushed accuracy to superhuman levels and defined the modern era of computer vision.

The connection: CNNs exploit three priors about images: locality, translation equivariance, and compositionality. These same principles appear in other domains — 1D convolutions for audio, graph convolutions for molecules, and the attention mechanism that eventually superseded convolutions for many tasks.

Ch 4 — Convolutional Neural Networks