4
CNNs exploit spatial structure by sharing weights across the image through sliding filters.
- Convolution: A small learnable filter slides across the input, detecting local patterns (edges, textures) regardless of their position in the image.
- Pooling: Downsamples feature maps to reduce computation and provide translation invariance — small shifts in the input don’t change the output.
- Hierarchical Features: Early layers detect edges, middle layers detect textures and parts, deep layers detect whole objects.
5
The evolution from AlexNet to EfficientNet shows how architectural innovations drove the deep learning revolution.
- AlexNet (2012): Proved deep CNNs + GPUs could crush traditional computer vision. The “ImageNet moment” that launched the deep learning era.
- ResNet (2015): Skip connections solved the degradation problem, enabling networks with 100+ layers. The most influential architecture innovation in deep learning.
- Transfer Learning: Pre-train on ImageNet, fine-tune on your task. This is the standard workflow — training from scratch is rarely necessary.
6
RNNs process sequences by maintaining a hidden state that acts as memory of past inputs.
- Hidden State: A vector that gets updated at each time step, carrying information from previous inputs. This gives RNNs a form of memory.
- Vanishing Gradients in Time: During backpropagation through time (BPTT), gradients shrink exponentially with sequence length, making it hard to learn long-range dependencies.
7
Gating mechanisms solved the vanishing gradient problem for sequences by learning what to remember and what to forget.
- LSTM Gates: Forget gate (what to discard), input gate (what to store), output gate (what to expose). This selective memory is what makes LSTMs work.
- GRU: A simplified LSTM with only two gates (reset and update). Often performs comparably with fewer parameters.
- Bidirectional RNNs: Process the sequence in both directions to capture both past and future context at each position.
8
Autoencoders learn compressed representations by training to reconstruct their own input through a bottleneck.
- Bottleneck: The narrow middle layer forces the network to learn the most important features of the data, discarding noise.
- Variational Autoencoders (VAEs): Encode inputs as probability distributions rather than fixed points, enabling smooth interpolation and generation of new data.
9
GANs learn by pitting two networks against each other: a generator that creates fakes and a discriminator that detects them.
- Adversarial Training: The generator improves by fooling the discriminator; the discriminator improves by catching fakes. Both get better through competition.
- Mode Collapse: The generator learns to produce only a few outputs that fool the discriminator, losing diversity. A persistent challenge in GAN training.
- StyleGAN: Introduced style-based generation with progressive growing, producing photorealistic faces at 1024×1024 resolution.
The Bottom Line: Each architecture was designed for a specific data type: CNNs for spatial data (images), RNNs/LSTMs for sequential data (text, audio), autoencoders for compression, GANs for generation. The Transformer eventually unified them all.