Ch 8: Autoencoders & Representation Learning

Ch 8 — Autoencoders & Representation Learning

Undercomplete autoencoders, VAEs, latent spaces, and disentanglement

Index

High Level

image

Input

arrow_forward

compress

Encoder

arrow_forward

hub

Latent

arrow_forward

open_in_full

Decoder

arrow_forward

image

Recon.

arrow_forward

auto_awesome

Generate

Click play or press Space to begin...

Step- / 8

compress

What Is an Autoencoder?

Learning to compress and reconstruct data

The Core Idea

An autoencoder is a neural network trained to copy its input to its output — but with a twist. It must pass through a bottleneck (latent space) that is smaller than the input. The encoder compresses the input into a compact latent representation z, and the decoder reconstructs the original input from z. The network is trained by minimizing reconstruction loss (typically MSE between input and output). The bottleneck forces the network to learn the most important features of the data.

Architecture

// Autoencoder structure Input x (784 dims, e.g., 28×28 image) ↓ Encoder: 784 → 256 → 64 ↓ Latent z (64 dims) ← bottleneck ↓ Decoder: 64 → 256 → 784 ↓ Output x̂ (784 dims) Loss = ||x - x̂||² // reconstruction error

Key insight: An autoencoder is an unsupervised learning method — it doesn’t need labels. The “label” is the input itself. By learning to reconstruct, it discovers the underlying structure of the data.

hub

The Latent Space

A compressed representation of reality

What the Latent Space Captures

The latent space z is a low-dimensional representation of the input. For face images, latent dimensions might correspond to features like pose, lighting, expression, or hair color — even though nobody told the network about these concepts. An undercomplete autoencoder (latent dim < input dim) is forced to learn a compressed representation. An overcomplete autoencoder (latent dim ≥ input dim) can trivially copy the input, so it needs regularization (sparsity, noise) to learn useful features.

Applications

// Autoencoder applications Dimensionality reduction: Like PCA but non-linear 784-dim images → 32-dim latent Denoising: Train on noisy input → clean output Network learns to remove noise Anomaly detection: Train on normal data only High reconstruction error = anomaly Feature learning: Use encoder output as features for downstream classifiers

Key insight: The latent space is a learned coordinate system for your data. Similar inputs map to nearby points. This makes autoencoders powerful for visualization, clustering, and as feature extractors for other models.

blur_on

Denoising & Sparse Autoencoders

Regularization techniques for better representations

Denoising Autoencoders (Vincent et al., 2008)

A denoising autoencoder corrupts the input (adding Gaussian noise, masking random pixels) and trains the network to reconstruct the clean original. This prevents the network from learning the identity function and forces it to capture robust features. The corruption acts as regularization, making the learned representations more generalizable. This idea directly inspired BERT’s masked language modeling — corrupting input tokens and predicting the originals.

Sparse Autoencoders

A sparse autoencoder adds a penalty that encourages most latent units to be inactive (near zero) for any given input. Only a few units “fire” for each input, creating a distributed, sparse code. This is inspired by neuroscience — only a small fraction of neurons in the brain are active at any time. Sparsity is enforced via L1 regularization or a KL divergence penalty on the average activation.

Key insight: Sparse autoencoders have recently gained attention in mechanistic interpretability research. Anthropic and others use sparse autoencoders to decompose LLM activations into interpretable features, helping understand what neural networks have learned.

casino

Variational Autoencoders (VAEs)

Kingma & Welling (2014) — from reconstruction to generation

The Generative Twist

A standard autoencoder maps inputs to specific points in latent space. But what if you want to generate new data? You’d need to sample from the latent space, but the space has no structure — there are gaps and irregular regions. The Variational Autoencoder (VAE) (Kingma & Welling, 2014) solves this by making the encoder output a probability distribution (mean μ and variance σ²) instead of a single point. The latent code z is sampled from this distribution. A KL divergence loss regularizes the distribution to be close to a standard normal N(0, 1), ensuring the latent space is smooth and continuous.

VAE Loss

// VAE loss = reconstruction + regularization L = E[||x - x̂||²] // reconstruction + KL(q(z|x) || p(z)) // regularization // Encoder outputs μ and log(σ²) // Reparameterization trick: z = μ + σ · ε, ε ~ N(0, 1) // Allows backprop through sampling // KL divergence (closed form for Gaussians): KL = -0.5 · Σ(1 + log(σ²) - μ² - σ²)

Key insight: The reparameterization trick is the key innovation — it moves the randomness outside the network (ε is sampled, not z), allowing gradients to flow through the encoder. Without this trick, you can’t backpropagate through a sampling operation.

tune

Disentangled Representations

Learning independent factors of variation

What Is Disentanglement?

A disentangled representation is one where each latent dimension corresponds to a single, independent factor of variation. For face images: dimension 1 controls pose, dimension 2 controls lighting, dimension 3 controls expression — and changing one doesn’t affect the others. β-VAE (Higgins et al., 2017) achieves this by increasing the weight of the KL divergence term (β > 1), which pressures each latent dimension to be more independent.

Latent Space Interpolation

// Interpolating in latent space z_A = encoder(face_A) // smiling person z_B = encoder(face_B) // frowning person // Linear interpolation for α in [0.0, 0.25, 0.5, 0.75, 1.0]: z = (1-α) · z_A + α · z_B image = decoder(z) // Smooth transition from smile to frown // Arithmetic in latent space: // z_smile - z_neutral + z_other_person // = other person smiling

Key insight: Smooth interpolation in latent space is the hallmark of a good generative model. If walking between two points produces realistic intermediate images, the model has learned a meaningful representation of the data manifold.

architecture

Convolutional Autoencoders

Using CNNs for image compression and generation

CNN-Based Architecture

For image data, the encoder uses convolutional layers with stride-2 to progressively downsample, and the decoder uses transposed convolutions (or upsampling + convolution) to upsample back to the original resolution. This preserves spatial structure far better than flattening images into vectors. Modern convolutional VAEs can generate realistic faces, rooms, and objects at resolutions up to 256×256.

Architecture

// Convolutional VAE Encoder: Conv2d(3→32, stride=2) // 64→32 Conv2d(32→64, stride=2) // 32→16 Conv2d(64→128, stride=2)// 16→8 Flatten → Linear → μ, log(σ²) Decoder: Linear → Reshape(128, 8, 8) ConvTranspose2d(128→64) // 8→16 ConvTranspose2d(64→32) // 16→32 ConvTranspose2d(32→3) // 32→64

Rule of thumb: Transposed convolutions can produce “checkerboard artifacts.” A common fix is to use nearest-neighbor upsampling followed by a regular convolution instead of transposed convolution.

compare

VAE vs. GAN vs. Diffusion

Comparing generative model families

Three Paradigms

VAEs learn an explicit latent space with smooth interpolation but produce blurry outputs (due to the MSE loss averaging over possibilities). GANs (next chapter) produce sharp images but have no encoder and are hard to train. Diffusion models (2020+) learn to denoise, producing the highest-quality images but requiring many sampling steps. Each has trade-offs between sample quality, training stability, latent space structure, and inference speed.

Comparison

// Generative model comparison VAE GAN Diffusion Quality: Medium High Highest Training: Stable Hard Stable Latent: Yes No* No* Speed: Fast Fast Slow Mode cov: Good Poor Good // * GANs/Diffusion have no explicit encoder // VAE + GAN hybrids exist (VAE-GAN) // Latent diffusion (Stable Diffusion) adds // a VAE encoder to diffusion models

Key insight: Stable Diffusion (2022) uses a VAE to compress images into a latent space, then runs the diffusion process in that latent space. This combines VAE’s efficient compression with diffusion’s high-quality generation — the best of both worlds.

school

Summary & What’s Next

From compression to generation

Key Takeaways

Autoencoders learn compressed representations by reconstructing their input through a bottleneck. VAEs add probabilistic structure to the latent space, enabling generation of new data. Disentangled representations separate independent factors of variation. These ideas — latent spaces, encoders, decoders, and the reparameterization trick — are foundational to modern generative AI, from Stable Diffusion’s latent space to the tokenizers used in vision transformers.

The connection: Autoencoders learn to compress; the next chapter covers GANs, which learn to create. While autoencoders minimize reconstruction error, GANs pit a generator against a discriminator in a game-theoretic framework — producing the sharpest images deep learning had ever seen.

Autoencoder Family Tree

// Evolution of autoencoders 2006: Deep autoencoders (Hinton) 2008: Denoising AE (Vincent) 2013: Sparse AE for feature learning 2014: VAE (Kingma & Welling) 2017: β-VAE (disentanglement) 2020: VQ-VAE-2 (discrete latent codes) 2022: Latent Diffusion / Stable Diffusion (VAE encoder + diffusion decoder) 2024: Sparse AE for LLM interpretability

arrow_back Ch 7: LSTMs, GRUs & Sequence Models Ch 9: Generative Adversarial Networks arrow_forward