Ch 5: How Diffusion Models Work

blur_on

The Forward Process

Gradually destroying an image with noise

Adding Noise Step by Step

The forward process takes a clean image and adds Gaussian noise over T steps (typically T=1,000). At step 0, you have the original image. At step T, you have pure random noise. Each step adds a small amount of noise controlled by a noise schedule (β).

Think of it like dissolving a sugar cube in water. At first, you can still see the cube. Gradually, it dissolves until the water is uniformly sweet. The forward process “dissolves” the image into noise.

The Math (Intuitive)

// Forward process: add noise at each step x_0 = clean image x_1 = x_0 + small noise (barely visible) x_2 = x_1 + small noise (slightly fuzzy) ... x_500 = mostly noise (image barely visible) ... x_1000 = pure Gaussian noise (image gone) // Key: this process is FIXED (no learning) // All the intelligence is in the reverse

Key insight: The forward process requires no learning — it’s just adding noise according to a schedule. All the intelligence is in the reverse process, where the model learns to undo the noise step by step. This separation is what makes diffusion training so stable.

replay

The Reverse Process

Learning to denoise step by step

The Denoising Network

A neural network (typically a U-Net) is trained to predict the noise added at each step. Given a noisy image at step t, it predicts what noise was added, and we subtract it to get a slightly cleaner image at step t-1. Repeat this T times to go from pure noise to a clean image.

Training

For each training image:
1. Pick a random timestep t (e.g., t=347)
2. Add noise corresponding to step t
3. Ask the network: “What noise was added?”
4. Loss = MSE between predicted noise and actual noise

Simple, stable, no adversarial training. Just predict the noise.

Generation

// Generating an image from scratch x_T = sample pure Gaussian noise for t = T, T-1, ..., 1: noise_pred = model(x_t, t) // predict noise x_{t-1} = remove_noise(x_t, noise_pred) x_0 = final clean image done! // Each step: a small, learnable refinement // Like a sculptor removing material gradually

Key insight: Each denoising step is a small, learnable transformation. The model doesn’t generate the whole image at once — it refines it gradually over many steps. This iterative refinement is why diffusion models produce such high-quality, detailed images.

architecture

The U-Net Architecture

The workhorse neural network inside diffusion models

Why U-Net?

The U-Net has an encoder-decoder structure with skip connections, shaped like the letter U:

• Encoder (downsampling): Compresses spatial dimensions, increases channels — captures “what” is in the image
• Bottleneck: Self-attention at lowest resolution — global context
• Decoder (upsampling): Restores spatial dimensions — generates details
• Skip connections: Connect encoder to decoder at each level — preserve fine details that would otherwise be lost

Cross-Attention: Where Text Meets Image

The U-Net’s cross-attention layers are where CLIP text embeddings are injected. At each resolution level, the model attends to the text embeddings, allowing the text prompt to guide what gets generated. This is the mechanism behind text-to-image.

The DiT Alternative

Diffusion Transformers (DiT) replace the U-Net with a pure Transformer architecture. Used in Sora, SD3, and Flux. Advantages: better scaling with compute, simpler architecture, stronger global attention. Disadvantage: more compute-intensive per step.

Key insight: The U-Net’s cross-attention layers are where text meets image. CLIP text embeddings are injected here, telling the denoiser what to generate. This is why prompt quality matters so much — the cross-attention mechanism is literally “reading” your prompt at every denoising step.

compress

Latent Diffusion

Running diffusion in compressed space for 10–100x efficiency

The Efficiency Problem

Running diffusion directly on 1024×1024 images requires processing 3.1 million values at each of 50+ denoising steps. That’s enormous compute — impractical for consumer hardware.

The Solution: Compress First

// Latent Diffusion pipeline 1. VAE Encoder (pre-trained, frozen) 1024×1024 image → 128×128 latent 64x spatial compression 2. Diffusion in Latent Space Add/remove noise on 128×128 latents 10-100x cheaper per step 3. VAE Decoder (pre-trained, frozen) 128×128 latent → 1024×1024 image // This is literally what "Stable Diffusion" is // Paper: "High-Resolution Image Synthesis // with Latent Diffusion Models" (2022)

Why It Works

• Perceptual compression: The VAE preserves perceptually important information while discarding imperceptible details
• Semantic latent space: Nearby points in latent space correspond to similar images — diffusion operates in a meaningful space
• Decoupled training: VAE is trained once; diffusion model is trained separately in the latent space
• Consumer-friendly: Runs on 8GB VRAM GPUs instead of requiring data-center hardware

Key insight: Latent diffusion is why Stable Diffusion can run on your laptop. Without VAE compression, you’d need data-center hardware for every image. The VAE handles the “pixel details” while diffusion handles the “creative generation” in compressed space.

tune

Classifier-Free Guidance (CFG)

The “creativity dial” that controls prompt adherence

The Problem

Diffusion models can generate diverse images, but how do you make them follow a specific text prompt? Without guidance, the model generates random samples from the training distribution. Classifier-Free Guidance (CFG) amplifies the influence of the text condition.

How CFG Works

At each denoising step, run the model twice:
1. Conditional: Denoise with the text prompt
2. Unconditional: Denoise without any prompt (empty text)

Final prediction = unconditional + guidance_scale × (conditional − unconditional)

The guidance scale amplifies the “direction” the text pushes the generation.

Guidance Scale Effects

// CFG scale = the "creativity dial" scale = 1.0 No guidance (random, diverse) scale = 3.0 Mild guidance (creative, loose) scale = 7.5 Default ← sweet spot for most use scale = 12.0 Strong (very prompt-faithful) scale = 20.0 Over-saturated, artifacts // Higher = more prompt adherence, less diversity // Lower = more creative, less controlled // Cost: 2x inference (conditional + unconditional)

Key insight: CFG is the most important user-facing parameter in image generation. When your images look “too generic,” increase CFG. When they look “over-saturated” or have artifacts, decrease it. 7.5 is the sweet spot for most use cases.

speed

Sampling & Speed

From 1,000 steps to real-time generation

The Speed Problem

Original DDPM: 1,000 denoising steps per image. At ~50ms per step on a GPU, that’s 50 seconds per image. Modern sampling techniques dramatically reduce this:

Sampling Methods

// Evolution of sampling speed DDPM (2020) 1000 steps ~50 sec DDIM (2021) 50 steps ~3 sec DPM-Solver (2022) 20 steps ~1.5 sec LCM (2023) 4-8 steps ~0.5 sec Consistency (2023) 1-4 steps ~0.2 sec SDXL Turbo (2023) 1 step ~0.1 sec // 500x speedup in 3 years!

Real-Time Generation

With consistency models and optimized inference:

• SDXL Turbo: 1–4 steps, ~0.1 seconds per image
• LCM-LoRA: 4 steps, ~0.5 seconds
• StreamDiffusion: Real-time at 30+ FPS
• FLUX.1-schnell: 4 steps, high quality

We’ve gone from minutes to milliseconds in 3 years. This enables entirely new applications: interactive design, live video effects, and responsive creative tools.

Key insight: The speed improvements in diffusion models are as important as the quality improvements. Real-time generation enables interactive applications that were impossible with 50-second generation times. The tradeoff is usually quality vs speed — fewer steps = faster but slightly lower quality.

settings

Noise Schedules & Flow Matching

The details that make diffusion work — and the next evolution

Noise Schedules

The noise schedule (β_t) controls how much noise is added at each step. The choice matters:

• Linear: Simple, works okay but wastes steps on very noisy/very clean regions
• Cosine: Better distribution of noise levels, used in improved DDPM
• Sigmoid: Used in newer models, concentrates steps where they matter most

The schedule affects both training stability and generation quality.

Flow Matching: The Next Evolution

Flow Matching (used in Stable Diffusion 3 and Flux) replaces the noise schedule with learned straight-line paths from noise to data. Instead of gradually adding/removing noise, the model learns to transport samples along optimal paths.

• Simpler math: No noise schedule to tune
• Faster convergence: Straighter paths = fewer steps needed
• Better quality: More efficient use of each denoising step
• Flexible: Can interpolate between any two distributions

Key insight: Flow Matching is to diffusion what DDIM was to DDPM — a more efficient way to traverse the same generative process. SD3 and Flux use Flow Matching and produce better results with fewer steps. This is the current state of the art.

school

Key Takeaways

The essential diffusion model concepts

Remember These

1. Forward: Add noise gradually until pure noise (fixed, no learning)

2. Reverse: Learn to denoise step by step (the neural network’s job)

3. U-Net: Encoder-decoder with skip connections and cross-attention for text

4. Latent: Compress with VAE first for 64x efficiency (Stable Diffusion)

5. CFG: Amplify text conditioning — guidance scale 7.5 is the sweet spot

6. Sampling: Modern methods need only 4–50 steps (down from 1,000)

Why This Matters

Every image and video generation tool you use — Stable Diffusion, DALL-E, Midjourney, Sora, Flux — runs this pipeline. Understanding it helps you:

• Write better prompts (understanding what CFG and cross-attention do)
• Choose the right settings (steps, guidance scale, sampler)
• Debug generation failures (artifacts, wrong content, low quality)
• Evaluate new models (what changed: U-Net vs DiT, noise schedule vs flow matching)

Next up: Chapter 6 puts diffusion into practice with text-to-image generation — Stable Diffusion, DALL-E 3, Midjourney, Flux, ControlNet, inpainting, and the creative workflow.

Ch 5 — How Diffusion Models Work