Ch 8 — Autoencoders & Representation Learning

Undercomplete autoencoders, VAEs, latent spaces, and disentanglement
High Level
image
Input
arrow_forward
compress
Encoder
arrow_forward
hub
Latent
arrow_forward
open_in_full
Decoder
arrow_forward
image
Recon.
arrow_forward
auto_awesome
Generate
-
Click play or press Space to begin...
Step- / 8
compress
What Is an Autoencoder?
Learning to compress and reconstruct data
The Core Idea
An autoencoder is a neural network trained to copy its input to its output — but with a twist. It must pass through a bottleneck (latent space) that is smaller than the input. The encoder compresses the input into a compact latent representation z, and the decoder reconstructs the original input from z. The network is trained by minimizing reconstruction loss (typically MSE between input and output). The bottleneck forces the network to learn the most important features of the data.
Architecture
// Autoencoder structure Input x (784 dims, e.g., 28×28 image) ↓ Encoder: 784 → 256 → 64 ↓ Latent z (64 dims) ← bottleneckDecoder: 64 → 256 → 784 ↓ Output x̂ (784 dims) Loss = ||x - x̂||² // reconstruction error
Key insight: An autoencoder is an unsupervised learning method — it doesn’t need labels. The “label” is the input itself. By learning to reconstruct, it discovers the underlying structure of the data.
hub
The Latent Space
A compressed representation of reality
What the Latent Space Captures
The latent space z is a low-dimensional representation of the input. For face images, latent dimensions might correspond to features like pose, lighting, expression, or hair color — even though nobody told the network about these concepts. An undercomplete autoencoder (latent dim < input dim) is forced to learn a compressed representation. An overcomplete autoencoder (latent dim ≥ input dim) can trivially copy the input, so it needs regularization (sparsity, noise) to learn useful features.
Applications
// Autoencoder applications Dimensionality reduction: Like PCA but non-linear 784-dim images → 32-dim latent Denoising: Train on noisy input → clean output Network learns to remove noise Anomaly detection: Train on normal data only High reconstruction error = anomaly Feature learning: Use encoder output as features for downstream classifiers
Key insight: The latent space is a learned coordinate system for your data. Similar inputs map to nearby points. This makes autoencoders powerful for visualization, clustering, and as feature extractors for other models.
blur_on
Denoising & Sparse Autoencoders
Regularization techniques for better representations
Denoising Autoencoders (Vincent et al., 2008)
A denoising autoencoder corrupts the input (adding Gaussian noise, masking random pixels) and trains the network to reconstruct the clean original. This prevents the network from learning the identity function and forces it to capture robust features. The corruption acts as regularization, making the learned representations more generalizable. This idea directly inspired BERT’s masked language modeling — corrupting input tokens and predicting the originals.
Sparse Autoencoders
A sparse autoencoder adds a penalty that encourages most latent units to be inactive (near zero) for any given input. Only a few units “fire” for each input, creating a distributed, sparse code. This is inspired by neuroscience — only a small fraction of neurons in the brain are active at any time. Sparsity is enforced via L1 regularization or a KL divergence penalty on the average activation.
Key insight: Sparse autoencoders have recently gained attention in mechanistic interpretability research. Anthropic and others use sparse autoencoders to decompose LLM activations into interpretable features, helping understand what neural networks have learned.
casino
Variational Autoencoders (VAEs)
Kingma & Welling (2014) — from reconstruction to generation
The Generative Twist
A standard autoencoder maps inputs to specific points in latent space. But what if you want to generate new data? You’d need to sample from the latent space, but the space has no structure — there are gaps and irregular regions. The Variational Autoencoder (VAE) (Kingma & Welling, 2014) solves this by making the encoder output a probability distribution (mean μ and variance σ²) instead of a single point. The latent code z is sampled from this distribution. A KL divergence loss regularizes the distribution to be close to a standard normal N(0, 1), ensuring the latent space is smooth and continuous.
VAE Loss
// VAE loss = reconstruction + regularization L = E[||x - x̂||²] // reconstruction + KL(q(z|x) || p(z)) // regularization // Encoder outputs μ and log(σ²) // Reparameterization trick: z = μ + σ · ε, ε ~ N(0, 1) // Allows backprop through sampling // KL divergence (closed form for Gaussians): KL = -0.5 · Σ(1 + log(σ²) - μ² - σ²)
Key insight: The reparameterization trick is the key innovation — it moves the randomness outside the network (ε is sampled, not z), allowing gradients to flow through the encoder. Without this trick, you can’t backpropagate through a sampling operation.
tune
Disentangled Representations
Learning independent factors of variation
What Is Disentanglement?
A disentangled representation is one where each latent dimension corresponds to a single, independent factor of variation. For face images: dimension 1 controls pose, dimension 2 controls lighting, dimension 3 controls expression — and changing one doesn’t affect the others. β-VAE (Higgins et al., 2017) achieves this by increasing the weight of the KL divergence term (β > 1), which pressures each latent dimension to be more independent.
Latent Space Interpolation
// Interpolating in latent space z_A = encoder(face_A) // smiling person z_B = encoder(face_B) // frowning person // Linear interpolation for α in [0.0, 0.25, 0.5, 0.75, 1.0]: z = (1-α) · z_A + α · z_B image = decoder(z) // Smooth transition from smile to frown // Arithmetic in latent space: // z_smile - z_neutral + z_other_person // = other person smiling
Key insight: Smooth interpolation in latent space is the hallmark of a good generative model. If walking between two points produces realistic intermediate images, the model has learned a meaningful representation of the data manifold.
architecture
Convolutional Autoencoders
Using CNNs for image compression and generation
CNN-Based Architecture
For image data, the encoder uses convolutional layers with stride-2 to progressively downsample, and the decoder uses transposed convolutions (or upsampling + convolution) to upsample back to the original resolution. This preserves spatial structure far better than flattening images into vectors. Modern convolutional VAEs can generate realistic faces, rooms, and objects at resolutions up to 256×256.
Architecture
// Convolutional VAE Encoder: Conv2d(3→32, stride=2) // 64→32 Conv2d(32→64, stride=2) // 32→16 Conv2d(64→128, stride=2)// 16→8 Flatten → Linear → μ, log(σ²) Decoder: Linear → Reshape(128, 8, 8) ConvTranspose2d(128→64) // 8→16 ConvTranspose2d(64→32) // 16→32 ConvTranspose2d(32→3) // 32→64
Rule of thumb: Transposed convolutions can produce “checkerboard artifacts.” A common fix is to use nearest-neighbor upsampling followed by a regular convolution instead of transposed convolution.
compare
VAE vs. GAN vs. Diffusion
Comparing generative model families
Three Paradigms
VAEs learn an explicit latent space with smooth interpolation but produce blurry outputs (due to the MSE loss averaging over possibilities). GANs (next chapter) produce sharp images but have no encoder and are hard to train. Diffusion models (2020+) learn to denoise, producing the highest-quality images but requiring many sampling steps. Each has trade-offs between sample quality, training stability, latent space structure, and inference speed.
Comparison
// Generative model comparison VAE GAN Diffusion Quality: Medium High Highest Training: Stable Hard Stable Latent: Yes No* No* Speed: Fast Fast Slow Mode cov: Good Poor Good // * GANs/Diffusion have no explicit encoder // VAE + GAN hybrids exist (VAE-GAN) // Latent diffusion (Stable Diffusion) adds // a VAE encoder to diffusion models
Key insight: Stable Diffusion (2022) uses a VAE to compress images into a latent space, then runs the diffusion process in that latent space. This combines VAE’s efficient compression with diffusion’s high-quality generation — the best of both worlds.
school
Summary & What’s Next
From compression to generation
Key Takeaways
Autoencoders learn compressed representations by reconstructing their input through a bottleneck. VAEs add probabilistic structure to the latent space, enabling generation of new data. Disentangled representations separate independent factors of variation. These ideas — latent spaces, encoders, decoders, and the reparameterization trick — are foundational to modern generative AI, from Stable Diffusion’s latent space to the tokenizers used in vision transformers.
The connection: Autoencoders learn to compress; the next chapter covers GANs, which learn to create. While autoencoders minimize reconstruction error, GANs pit a generator against a discriminator in a game-theoretic framework — producing the sharpest images deep learning had ever seen.
Autoencoder Family Tree
// Evolution of autoencoders 2006: Deep autoencoders (Hinton) 2008: Denoising AE (Vincent) 2013: Sparse AE for feature learning 2014: VAE (Kingma & Welling) 2017: β-VAE (disentanglement) 2020: VQ-VAE-2 (discrete latent codes) 2022: Latent Diffusion / Stable Diffusion (VAE encoder + diffusion decoder) 2024: Sparse AE for LLM interpretability