Ch 3 — The Generative Model Family Tree

VAEs, GANs, Normalizing Flows, Diffusion — how each generates, and why diffusion won
High Level
science
VAE
arrow_forward
flash_on
GAN
arrow_forward
waves
Flow
arrow_forward
blur_on
Diffusion
arrow_forward
compare
Compare
arrow_forward
emoji_events
Winner
-
Click play or press Space to begin...
Step- / 8
science
VAEs: Encode, Sample, Decode
The foundation of latent space generation
How VAEs Work
A Variational Autoencoder has two parts:

1. Encoder: Compresses an image into a small latent vector (e.g., 256 dimensions) — not a single point, but a distribution (mean + variance)
2. Decoder: Reconstructs the image from a sample drawn from that distribution

The key innovation: the latent space is regularized to be smooth and continuous, so you can sample random points and decode them into new, coherent images.
Strengths & Weaknesses
Stable training: No adversarial dynamics, just reconstruction + KL divergence loss
Smooth latent space: Interpolation between images works beautifully
Fast generation: Single forward pass through decoder
Weakness: Outputs tend to be blurry because the model optimizes for the average of all possible reconstructions
The Lasting Legacy
VAEs aren’t used directly for image generation anymore, but they’re critical components of modern systems:

Stable Diffusion: Uses a VAE to compress images to latent space before running diffusion (this is the “latent” in Latent Diffusion)
Video models: VAE compression reduces the dimensionality of video frames
Audio models: VAE-like encoders compress audio spectrograms
VQ-VAE: Discrete variant used in DALL-E 1 and audio tokenization
Key insight: The VAE’s lasting contribution isn’t direct image generation — it’s the concept of a learned latent space. This idea makes diffusion models practical by letting them operate in compressed space instead of pixel space.
flash_on
GANs: The Adversarial Game
Generator vs discriminator — sharp images but unstable training
How GANs Work
Two neural networks compete in a minimax game:

1. Generator (G): Creates fake images from random noise. Its goal: fool the discriminator.
2. Discriminator (D): Tries to distinguish real images from fakes. Its goal: catch the generator.

They train together in an adversarial loop: G gets better at fooling D, and D gets better at detecting fakes. This dynamic produces sharp, realistic images — far sharper than VAEs.
The GAN Era (2014–2021)
DCGAN (2015): First stable GAN with convolutional architecture
StyleGAN (2018): Photorealistic faces with style control at each layer
StyleGAN2 (2020): Near-perfect face generation, 1024×1024
BigGAN (2018): Class-conditional generation at ImageNet scale
Why GANs Lost
Training instability: Mode collapse (generator produces limited variety), oscillation between G and D, failure to converge
No text conditioning: Hard to control what gets generated — you get random samples from the learned distribution
No likelihood: Can’t measure how “likely” a generated image is
Hyperparameter sensitivity: Small changes in learning rate or architecture can cause training to collapse entirely
Key insight: GANs produced the first truly photorealistic AI images and proved that neural networks could generate convincing visual content. But their training instability made them impractical for the text-to-image revolution. Diffusion models inherited GAN-level quality while adding stability and controllability.
waves
Normalizing Flows & Autoregressive Models
Two more branches of the generative family tree
Normalizing Flows
Flows use a series of invertible transformations to map a simple distribution (Gaussian) to a complex one (images). Because every step is invertible, you can compute exact likelihoods — mathematically elegant.

Strengths: Exact likelihood, invertible, theoretically clean
Weaknesses: Computationally expensive, limited expressiveness due to invertibility constraint
Legacy: Flow Matching (used in Stable Diffusion 3 and Flux) is a modern descendant that relaxes the invertibility constraint
Autoregressive Image Models
Treat image generation like text generation: convert the image into a sequence of discrete tokens (using a VQ-VAE), then predict the next token autoregressively. This is how DALL-E 1 worked.

Strengths: Unified architecture with text, in-context learning, flexible conditioning
Weaknesses: Slow generation (one token at a time), error accumulation
Comeback: Parti (Google), Chameleon (Meta), and some Gemini capabilities use autoregressive generation
Key insight: The autoregressive approach is appealing because it unifies text and image generation in one architecture. As models scale, this approach may rival or complement diffusion. Some predict the future is hybrid: autoregressive for structure, diffusion for detail.
blur_on
Diffusion Models: The Champion
Add noise, then learn to reverse it
The Core Idea
1. Forward process: Gradually add Gaussian noise to an image over T steps (typically T=1000) until it becomes pure random noise
2. Reverse process: Train a neural network to predict and remove the noise at each step
3. Generation: Start from pure noise, apply the learned denoising T times, get a clean image

Conceptually simple, mathematically elegant, and produces stunning results.
Why Diffusion Won
Stable training: Simple MSE loss (predict the noise), no adversarial dynamics
High quality: Matches or exceeds GAN quality at high resolutions
Diversity: No mode collapse — generates the full distribution
Controllable: Classifier-free guidance (CFG) enables precise text conditioning
Composable: Easy to add conditions: text, edges, depth maps, poses
The Latent Diffusion Trick
Running diffusion on full-resolution images (1024×1024 = 3.1M values) is extremely expensive. Latent Diffusion (the innovation behind Stable Diffusion) first compresses the image to a small latent space using a pre-trained VAE, then runs diffusion in that compressed space.

A 1024×1024 image becomes a 128×128 latent — 64x compression. Diffusion in latent space is 10–100x more efficient with minimal quality loss.
Key insight: Diffusion models are the backbone of virtually all modern image and video generation: Stable Diffusion, DALL-E 3, Midjourney, Sora, and Flux all use diffusion (or its variants like Flow Matching) at their core. Chapter 5 dives deep into the mechanics.
compare
Head-to-Head Comparison
Strengths and weaknesses of each approach
Comparison Table
// Generative model comparison Quality Stable Diverse Control Speed VAE Low High High Low Fast GAN High Low Low Low Fast Flow Medium High High Medium Slow Autoregr. High High High High Slow Diffusion High High High High Medium // Diffusion wins on quality + stability + control // GANs win on speed (single forward pass) // Autoregressive wins on unification with text
The Hybrid Future
Modern systems increasingly combine approaches:

VAE + Diffusion: Latent Diffusion (Stable Diffusion) — VAE compresses, diffusion generates
GAN + Diffusion: Consistency models use adversarial training to distill diffusion into fewer steps
Flow + Diffusion: Flow Matching (Flux, SD3) replaces the noise schedule with learned flow paths
Autoregressive + Diffusion: Some video models use autoregressive for keyframes, diffusion for interpolation
Key insight: The “winner” isn’t pure diffusion — it’s hybrid architectures that cherry-pick the best innovations from the entire family tree. Understanding all four families helps you understand why modern systems work the way they do.
timeline
The Historical Arc
From VAEs to modern hybrid systems
Evolution Timeline
2013 VAE (Kingma & Welling) 2014 GAN (Goodfellow et al.) 2015 DCGAN, Normalizing Flows 2018 StyleGAN, BigGAN (peak GAN era) 2020 DDPM (modern diffusion revival) 2021 DALL-E 1 (autoregressive) 2022 Latent Diffusion / Stable Diffusion 2023 SDXL, DALL-E 3, Midjourney v5 2024 Flux, SD3 (Flow Matching) 2025 Consistency models, hybrid approaches
The Pattern
Each generation builds on the previous:

VAEs gave us the concept of learned latent spaces
GANs proved neural networks could generate photorealistic images
Flows showed exact likelihood computation was possible
Diffusion combined stability with quality and controllability
Latent Diffusion married VAE compression with diffusion generation
Flow Matching improved diffusion efficiency with learned transport paths
Key insight: No approach was “wrong” — each contributed essential ideas that live on in modern systems. The VAE’s latent space, the GAN’s sharpness, the flow’s mathematical elegance, and diffusion’s stability all coexist in today’s best models.
compress
Latent Diffusion: The Winning Combination
VAE compression + diffusion generation = Stable Diffusion
The Architecture
// Stable Diffusion = Latent Diffusion Model Step 1: VAE Encoder 1024×1024 image → 128×128 latent (64x spatial compression) Step 2: CLIP Text Encoder "a sunset over mountains" → text embeddings Step 3: Diffusion in Latent Space Start with noise in 128×128 latent space Denoise with U-Net conditioned on text embeddings (50 steps with DPM-Solver) Step 4: VAE Decoder 128×128 latent → 1024×1024 image
Why This Works So Well
Efficiency: Diffusion operates on 128×128 latents instead of 1024×1024 pixels — 64x less data to process at each step
Quality: The VAE preserves perceptual quality; the diffusion model handles the creative generation
Controllability: CLIP text embeddings guide generation through cross-attention in the U-Net
Accessibility: Runs on consumer GPUs (8GB VRAM) instead of requiring data-center hardware
Key insight: Latent Diffusion is the single most important architecture in generative AI. Understanding its three components — VAE (compression), CLIP (text understanding), and U-Net (diffusion) — unlocks understanding of Stable Diffusion, DALL-E 3, and most modern image generators.
school
Key Takeaways
What to remember from the generative family tree
The Essential Concepts
1. Latent space (VAE): Compressed representation where generation happens efficiently

2. Adversarial training (GAN): Two networks competing to improve — produces sharp images but unstable

3. Denoising (Diffusion): Learning to remove noise step by step — stable, high-quality, controllable

4. Classifier-free guidance: Amplifying text conditioning to control what gets generated

5. Latent Diffusion: The winning hybrid — VAE compression + diffusion generation
Why This Matters
Every image, video, and audio generation model you’ll encounter uses one or more of these techniques. Understanding the family tree lets you:

• Understand why certain models produce certain artifacts (blurry = VAE-like, sharp but repetitive = GAN-like)
• Predict which approaches will improve and how
• Make informed choices about which tools to use for your specific needs
• Debug generation failures by understanding the underlying mechanism
Next up: Chapter 4 covers CLIP and contrastive learning — the technique that taught AI to connect images and text, enabling the entire text-to-image revolution. Without CLIP, there would be no Stable Diffusion.