Ch 7: CNNs — High Level

Ch 7 — Convolutional Neural Networks

How CNNs see — from pixels to features to classification

Index Under the Hood →

High Level

image

Input

arrow_forward

filter_alt

Convolve

arrow_forward

compress

Pool

arrow_forward

layers

Deep

arrow_forward

view_list

Flatten

arrow_forward

label

Classify

Click play or press Space to begin the journey...

Step- / 8

image

Why CNNs?

The problem with using MLPs for images

The Problem

A 224×224 RGB image has 150,528 pixels. A fully connected layer with 1000 neurons would need 150 million weights — just for the first layer. This is wasteful because images have spatial structure: nearby pixels are related, and the same pattern (an edge, a texture) can appear anywhere in the image.

The CNN Insight

Instead of connecting every pixel to every neuron, use small filters that slide across the image, detecting local patterns. The same filter is reused everywhere (weight sharing), dramatically reducing parameters. A 3×3 filter has only 9 weights but can detect edges anywhere in the image.

MLP on 224×224 Image

First layer: 224×224×3 × 1000 = 150M parameters. No spatial awareness. A cat in the corner looks completely different from a cat in the center.

CNN on 224×224 Image

First layer: 64 filters × 3×3×3 = 1,728 parameters. Spatially aware. Same filter detects edges everywhere. Translation invariant.

Biological inspiration: CNNs were inspired by Hubel and Wiesel’s 1962 Nobel Prize-winning discovery that neurons in the visual cortex respond to specific local patterns (edges, orientations) in small receptive fields, not the entire visual field.

filter_alt

The Convolution Operation

Sliding a small filter across the image to detect patterns

How It Works

A small kernel (typically 3×3 or 5×5) slides across the input image. At each position, it computes the dot product between the kernel weights and the overlapping image patch. The result is a feature map — a new image showing where the pattern was detected.

# 3×3 edge-detection kernel example Kernel: [-1, -1, -1] [-1, 8, -1] [-1, -1, -1] # Slide across image, compute dot product # at each position: output[i,j] = ∑ kernel × image_patch[i,j] # High output = strong edge detected # Low output = smooth region (no edge)

Key Concepts

Stride: How many pixels the kernel moves each step. Stride 1 = move one pixel. Stride 2 = skip every other pixel (halves output size).

Padding: Add zeros around the image border so the output has the same size as the input. Without padding, each convolution shrinks the image.

Multiple filters: Each filter detects a different pattern. 64 filters produce 64 feature maps. The network learns which filters are useful during training.

Weight sharing is the key: One 3×3 filter has 9 weights but is applied at every position in the image. A 224×224 image has ~50,000 positions. The same 9 weights detect the same pattern everywhere — this is why CNNs are translation invariant.

compress

Pooling

Downsampling feature maps to reduce computation

Max Pooling

Divide the feature map into non-overlapping patches (typically 2×2) and keep only the maximum value in each patch. This halves the spatial dimensions, reduces computation by 4x, and makes the network more robust to small translations.

# Max pooling 2×2 example Input: Output: [1, 3, 2, 4] [5, 6, 1, 8] → [6, 8] [3, 2, 7, 1] [9, 4, 8, 3] → [9, 8] # Takes the max from each 2×2 block # 4×4 → 2×2 (halves each dimension)

Why Pool?

1. Reduce computation: Fewer pixels to process in subsequent layers
2. Translation invariance: Small shifts in input don’t change the max value
3. Larger receptive field: After pooling, each neuron “sees” a larger area of the original image
4. Prevent overfitting: Fewer parameters in later layers

Modern trend: Many modern architectures replace pooling with strided convolutions (stride 2 convolution achieves similar downsampling but with learnable parameters). ResNets and EfficientNets use strided convolutions. Some architectures like ViT eliminate pooling entirely.

layers

Hierarchical Feature Learning

How deeper layers learn increasingly abstract features

Layer-by-Layer Abstraction

Each convolutional layer builds on the previous one, creating a hierarchy of features from simple to complex. This is the fundamental power of deep CNNs — they automatically learn the right features for the task.

# What each layer learns (image classification) Layer 1: Edges, corners, color gradients Simple, low-level patterns | / \ — + L T Layer 2: Textures, simple shapes Combinations of edges circles, grids, stripes Layer 3: Parts of objects Combinations of textures/shapes eyes, wheels, windows, fur Layer 4+: Whole objects, scenes Combinations of parts faces, cars, dogs, buildings

Receptive Field Growth

Each layer’s neurons “see” a larger area of the original image. Layer 1 neurons see a 3×3 patch. Layer 2 neurons see a 5×5 patch (through the layer 1 neurons they connect to). By layer 10, neurons may see the entire image. This is how local patterns combine into global understanding.

Transfer learning works because of this hierarchy. Early layers learn universal features (edges, textures) that transfer across tasks. A CNN trained on ImageNet can be fine-tuned for medical imaging by replacing only the last few layers. The edge and texture detectors are reusable.

view_list

CNN Architecture

The standard pattern: Conv → ReLU → Pool → repeat → Flatten → Dense

# Classic CNN architecture Input: 224 × 224 × 3 (RGB image) Conv1: 64 filters, 3×3, ReLU Conv2: 64 filters, 3×3, ReLU Pool: 2×2 max pool → 112×112×64 Conv3: 128 filters, 3×3, ReLU Conv4: 128 filters, 3×3, ReLU Pool: 2×2 max pool → 56×56×128 Conv5: 256 filters, 3×3, ReLU Conv6: 256 filters, 3×3, ReLU Pool: 2×2 max pool → 28×28×256 Flatten: 28×28×256 = 200,704 Dense: 4096 → 4096 → 1000 Output: softmax (1000 classes)

The Pattern

Spatial dimensions decrease: 224 → 112 → 56 → 28 → 14 → 7
Channel depth increases: 3 → 64 → 128 → 256 → 512

The network trades spatial resolution for feature richness. Early layers have large spatial maps with few channels. Deep layers have small spatial maps with many channels. The final dense layers combine all features for classification.

Batch norm + residual connections (from Ch 6) are added to every modern CNN. Each conv block becomes: Conv → BatchNorm → ReLU + skip connection. This is the ResNet pattern that enabled 100+ layer CNNs.

timeline

Architecture Evolution

From LeNet to ResNet — 25 years of innovation

LeNet-5 (LeCun, 1998) 7 layers, 60K params Handwritten digits (MNIST) First practical CNN AlexNet (Krizhevsky, 2012) 8 layers, 60M params ImageNet top-5 error: 15.3% (vs 26% prev) ReLU, dropout, GPU training VGGNet (Simonyan, 2014) 16-19 layers, 138M params Uniform 3×3 convolutions Proved depth matters GoogLeNet (Szegedy, 2014) 22 layers, 6.8M params Inception modules (multi-scale) Efficient parameter usage

ResNet (He, 2015) 50-152 layers, 25-60M params Skip connections (y = F(x) + x) 3.57% top-5 error (superhuman!) EfficientNet (Tan, 2019) Compound scaling (depth+width+resolution) 5.3M params, better than ResNet Neural architecture search (NAS) Vision Transformer (Dosovitskiy, 2020) No convolutions at all! Split image into patches, use attention Matches/beats CNNs with enough data

The trend: ImageNet top-5 error went from 26% (2011) to 3.57% (2015, ResNet) — surpassing human performance (~5%). The key innovations: depth (VGG), multi-scale (Inception), skip connections (ResNet), efficient scaling (EfficientNet), and now attention (ViT).

apps

CNN Applications

Beyond classification — detection, segmentation, generation

# CNN task types Classification Input: image → Output: label “This is a cat” Object Detection (YOLO, Faster R-CNN) Input: image → Output: bounding boxes + labels “Cat at (x,y,w,h), dog at (x,y,w,h)” Semantic Segmentation (U-Net, DeepLab) Input: image → Output: per-pixel labels Every pixel classified (road, car, person) Instance Segmentation (Mask R-CNN) Input: image → Output: per-pixel per-object Separate masks for each individual object Image Generation (GANs, Diffusion) Input: noise/text → Output: image Create photorealistic images (Ch 11)

Real-World Impact

Medical imaging: Detect tumors, diabetic retinopathy, skin cancer (dermatologist-level accuracy)
Self-driving cars: Detect pedestrians, lanes, signs, other vehicles in real-time
Facial recognition: Unlock phones, security systems, photo organization
Agriculture: Detect crop diseases, count livestock from satellite imagery
Manufacturing: Detect defects on assembly lines

YOLO (You Only Look Once) processes an entire image in a single forward pass, achieving real-time object detection at 30+ FPS. This made practical applications like self-driving cars and security systems possible. YOLOv8 (2023) detects objects in under 5ms.

visibility

CNNs Today & Key Takeaways

The state of computer vision in the transformer era

CNNs vs Vision Transformers

CNNs still dominate when data is limited, latency matters, or the task is well-defined (medical imaging, edge devices, real-time detection). Vision Transformers (ViT) excel with massive datasets and compute, and are increasingly used in multimodal models (CLIP, DALL-E). Many modern systems combine both — CNN backbones with transformer heads.

The inductive bias advantage: CNNs have built-in assumptions (locality, translation invariance, weight sharing) that make them data-efficient. ViTs must learn these properties from data, requiring much more training data. For small datasets, CNNs still win.

Key Takeaways

1. CNNs exploit spatial structure via local filters and weight sharing

2. Convolution + ReLU + pooling is the core building block

3. Deeper layers learn increasingly abstract features (edges → objects)

4. Spatial dims decrease, channel depth increases through the network

5. ResNet skip connections enabled 100+ layer CNNs

6. Applications: classification, detection, segmentation, generation

7. Vision Transformers are emerging but CNNs remain essential

Coming up: Ch 8 covers RNNs for sequential data (text, time series). Ch 9 introduces the transformer — which replaced RNNs and is now challenging CNNs too.