Ch 3: The Generative Model Family Tree

Ch 3 — The Generative Model Family Tree

VAEs, GANs, Normalizing Flows, Diffusion — how each generates, and why diffusion won

Index

High Level

science

VAE

arrow_forward

flash_on

GAN

arrow_forward

waves

Flow

arrow_forward

blur_on

Diffusion

arrow_forward

compare

Compare

arrow_forward

emoji_events

Winner

Click play or press Space to begin...

Step- / 8

science

VAEs: Encode, Sample, Decode

The foundation of latent space generation

How VAEs Work

A Variational Autoencoder has two parts:

1. Encoder: Compresses an image into a small latent vector (e.g., 256 dimensions) — not a single point, but a distribution (mean + variance)
2. Decoder: Reconstructs the image from a sample drawn from that distribution

The key innovation: the latent space is regularized to be smooth and continuous, so you can sample random points and decode them into new, coherent images.

Strengths & Weaknesses

• Stable training: No adversarial dynamics, just reconstruction + KL divergence loss
• Smooth latent space: Interpolation between images works beautifully
• Fast generation: Single forward pass through decoder
• Weakness: Outputs tend to be blurry because the model optimizes for the average of all possible reconstructions

The Lasting Legacy

VAEs aren’t used directly for image generation anymore, but they’re critical components of modern systems:

• Stable Diffusion: Uses a VAE to compress images to latent space before running diffusion (this is the “latent” in Latent Diffusion)
• Video models: VAE compression reduces the dimensionality of video frames
• Audio models: VAE-like encoders compress audio spectrograms
• VQ-VAE: Discrete variant used in DALL-E 1 and audio tokenization

Key insight: The VAE’s lasting contribution isn’t direct image generation — it’s the concept of a learned latent space. This idea makes diffusion models practical by letting them operate in compressed space instead of pixel space.

flash_on

GANs: The Adversarial Game

Generator vs discriminator — sharp images but unstable training

How GANs Work

Two neural networks compete in a minimax game:

1. Generator (G): Creates fake images from random noise. Its goal: fool the discriminator.
2. Discriminator (D): Tries to distinguish real images from fakes. Its goal: catch the generator.

They train together in an adversarial loop: G gets better at fooling D, and D gets better at detecting fakes. This dynamic produces sharp, realistic images — far sharper than VAEs.

The GAN Era (2014–2021)

• DCGAN (2015): First stable GAN with convolutional architecture
• StyleGAN (2018): Photorealistic faces with style control at each layer
• StyleGAN2 (2020): Near-perfect face generation, 1024×1024
• BigGAN (2018): Class-conditional generation at ImageNet scale

Why GANs Lost

• Training instability: Mode collapse (generator produces limited variety), oscillation between G and D, failure to converge
• No text conditioning: Hard to control what gets generated — you get random samples from the learned distribution
• No likelihood: Can’t measure how “likely” a generated image is
• Hyperparameter sensitivity: Small changes in learning rate or architecture can cause training to collapse entirely

Key insight: GANs produced the first truly photorealistic AI images and proved that neural networks could generate convincing visual content. But their training instability made them impractical for the text-to-image revolution. Diffusion models inherited GAN-level quality while adding stability and controllability.

waves

Normalizing Flows & Autoregressive Models

Two more branches of the generative family tree

Normalizing Flows

Flows use a series of invertible transformations to map a simple distribution (Gaussian) to a complex one (images). Because every step is invertible, you can compute exact likelihoods — mathematically elegant.

• Strengths: Exact likelihood, invertible, theoretically clean
• Weaknesses: Computationally expensive, limited expressiveness due to invertibility constraint
• Legacy: Flow Matching (used in Stable Diffusion 3 and Flux) is a modern descendant that relaxes the invertibility constraint

Autoregressive Image Models

Treat image generation like text generation: convert the image into a sequence of discrete tokens (using a VQ-VAE), then predict the next token autoregressively. This is how DALL-E 1 worked.

• Strengths: Unified architecture with text, in-context learning, flexible conditioning
• Weaknesses: Slow generation (one token at a time), error accumulation
• Comeback: Parti (Google), Chameleon (Meta), and some Gemini capabilities use autoregressive generation

Key insight: The autoregressive approach is appealing because it unifies text and image generation in one architecture. As models scale, this approach may rival or complement diffusion. Some predict the future is hybrid: autoregressive for structure, diffusion for detail.

blur_on

Diffusion Models: The Champion

Add noise, then learn to reverse it

The Core Idea

1. Forward process: Gradually add Gaussian noise to an image over T steps (typically T=1000) until it becomes pure random noise
2. Reverse process: Train a neural network to predict and remove the noise at each step
3. Generation: Start from pure noise, apply the learned denoising T times, get a clean image

Conceptually simple, mathematically elegant, and produces stunning results.

Why Diffusion Won

• Stable training: Simple MSE loss (predict the noise), no adversarial dynamics
• High quality: Matches or exceeds GAN quality at high resolutions
• Diversity: No mode collapse — generates the full distribution
• Controllable: Classifier-free guidance (CFG) enables precise text conditioning
• Composable: Easy to add conditions: text, edges, depth maps, poses

The Latent Diffusion Trick

Running diffusion on full-resolution images (1024×1024 = 3.1M values) is extremely expensive. Latent Diffusion (the innovation behind Stable Diffusion) first compresses the image to a small latent space using a pre-trained VAE, then runs diffusion in that compressed space.

A 1024×1024 image becomes a 128×128 latent — 64x compression. Diffusion in latent space is 10–100x more efficient with minimal quality loss.

Key insight: Diffusion models are the backbone of virtually all modern image and video generation: Stable Diffusion, DALL-E 3, Midjourney, Sora, and Flux all use diffusion (or its variants like Flow Matching) at their core. Chapter 5 dives deep into the mechanics.

compare

Head-to-Head Comparison

Strengths and weaknesses of each approach

Comparison Table

// Generative model comparison Quality Stable Diverse Control Speed VAE Low High High Low Fast GAN High Low Low Low Fast Flow Medium High High Medium Slow Autoregr. High High High High Slow Diffusion High High High High Medium // Diffusion wins on quality + stability + control // GANs win on speed (single forward pass) // Autoregressive wins on unification with text

The Hybrid Future

Modern systems increasingly combine approaches:

• VAE + Diffusion: Latent Diffusion (Stable Diffusion) — VAE compresses, diffusion generates
• GAN + Diffusion: Consistency models use adversarial training to distill diffusion into fewer steps
• Flow + Diffusion: Flow Matching (Flux, SD3) replaces the noise schedule with learned flow paths
• Autoregressive + Diffusion: Some video models use autoregressive for keyframes, diffusion for interpolation

Key insight: The “winner” isn’t pure diffusion — it’s hybrid architectures that cherry-pick the best innovations from the entire family tree. Understanding all four families helps you understand why modern systems work the way they do.

timeline

The Historical Arc

From VAEs to modern hybrid systems

Evolution Timeline

2013 VAE (Kingma & Welling) 2014 GAN (Goodfellow et al.) 2015 DCGAN, Normalizing Flows 2018 StyleGAN, BigGAN (peak GAN era) 2020 DDPM (modern diffusion revival) 2021 DALL-E 1 (autoregressive) 2022 Latent Diffusion / Stable Diffusion 2023 SDXL, DALL-E 3, Midjourney v5 2024 Flux, SD3 (Flow Matching) 2025 Consistency models, hybrid approaches

The Pattern

Each generation builds on the previous:

• VAEs gave us the concept of learned latent spaces
• GANs proved neural networks could generate photorealistic images
• Flows showed exact likelihood computation was possible
• Diffusion combined stability with quality and controllability
• Latent Diffusion married VAE compression with diffusion generation
• Flow Matching improved diffusion efficiency with learned transport paths

Key insight: No approach was “wrong” — each contributed essential ideas that live on in modern systems. The VAE’s latent space, the GAN’s sharpness, the flow’s mathematical elegance, and diffusion’s stability all coexist in today’s best models.

compress

Latent Diffusion: The Winning Combination

VAE compression + diffusion generation = Stable Diffusion

The Architecture

// Stable Diffusion = Latent Diffusion Model Step 1: VAE Encoder 1024×1024 image → 128×128 latent (64x spatial compression) Step 2: CLIP Text Encoder "a sunset over mountains" → text embeddings Step 3: Diffusion in Latent Space Start with noise in 128×128 latent space Denoise with U-Net conditioned on text embeddings (50 steps with DPM-Solver) Step 4: VAE Decoder 128×128 latent → 1024×1024 image

Why This Works So Well

• Efficiency: Diffusion operates on 128×128 latents instead of 1024×1024 pixels — 64x less data to process at each step
• Quality: The VAE preserves perceptual quality; the diffusion model handles the creative generation
• Controllability: CLIP text embeddings guide generation through cross-attention in the U-Net
• Accessibility: Runs on consumer GPUs (8GB VRAM) instead of requiring data-center hardware

Key insight: Latent Diffusion is the single most important architecture in generative AI. Understanding its three components — VAE (compression), CLIP (text understanding), and U-Net (diffusion) — unlocks understanding of Stable Diffusion, DALL-E 3, and most modern image generators.

school

Key Takeaways

What to remember from the generative family tree

The Essential Concepts

1. Latent space (VAE): Compressed representation where generation happens efficiently

2. Adversarial training (GAN): Two networks competing to improve — produces sharp images but unstable

3. Denoising (Diffusion): Learning to remove noise step by step — stable, high-quality, controllable

4. Classifier-free guidance: Amplifying text conditioning to control what gets generated

5. Latent Diffusion: The winning hybrid — VAE compression + diffusion generation

Why This Matters

Every image, video, and audio generation model you’ll encounter uses one or more of these techniques. Understanding the family tree lets you:

• Understand why certain models produce certain artifacts (blurry = VAE-like, sharp but repetitive = GAN-like)
• Predict which approaches will improve and how
• Make informed choices about which tools to use for your specific needs
• Debug generation failures by understanding the underlying mechanism

Next up: Chapter 4 covers CLIP and contrastive learning — the technique that taught AI to connect images and text, enabling the entire text-to-image revolution. Without CLIP, there would be no Stable Diffusion.

arrow_back Ch 2: How Machines See Ch 4: Contrastive Learning & CLIP arrow_forward