A Modern CNN
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1),
nn.ReLU(),
nn.AdaptiveAvgPool2d(1),
)
self.classifier = nn.Linear(128, 10)
def forward(self, x):
x = self.features(x)
x = x.view(x.size(0), -1)
return self.classifier(x)
What's Next
This chapter covered the mechanics of CNNs: convolutions, pooling, stride, padding, and feature hierarchies. The next chapter explores the landmark CNN architectures — AlexNet, VGG, GoogLeNet, and ResNet — that pushed accuracy to superhuman levels and defined the modern era of computer vision.
The connection: CNNs exploit three priors about images: locality, translation equivariance, and compositionality. These same principles appear in other domains — 1D convolutions for audio, graph convolutions for molecules, and the attention mechanism that eventually superseded convolutions for many tasks.