Ch 5 — How Diffusion Models Work

Forward noise, reverse denoising, U-Net, latent diffusion, classifier-free guidance — the math made intuitive
High Level
blur_on
Noise
arrow_forward
replay
Denoise
arrow_forward
architecture
U-Net
arrow_forward
compress
Latent
arrow_forward
tune
CFG
arrow_forward
image
Generate
-
Click play or press Space to begin...
Step- / 8
blur_on
The Forward Process
Gradually destroying an image with noise
Adding Noise Step by Step
The forward process takes a clean image and adds Gaussian noise over T steps (typically T=1,000). At step 0, you have the original image. At step T, you have pure random noise. Each step adds a small amount of noise controlled by a noise schedule (β).

Think of it like dissolving a sugar cube in water. At first, you can still see the cube. Gradually, it dissolves until the water is uniformly sweet. The forward process “dissolves” the image into noise.
The Math (Intuitive)
// Forward process: add noise at each step x_0 = clean image x_1 = x_0 + small noise (barely visible) x_2 = x_1 + small noise (slightly fuzzy) ... x_500 = mostly noise (image barely visible) ... x_1000 = pure Gaussian noise (image gone) // Key: this process is FIXED (no learning) // All the intelligence is in the reverse
Key insight: The forward process requires no learning — it’s just adding noise according to a schedule. All the intelligence is in the reverse process, where the model learns to undo the noise step by step. This separation is what makes diffusion training so stable.
replay
The Reverse Process
Learning to denoise step by step
The Denoising Network
A neural network (typically a U-Net) is trained to predict the noise added at each step. Given a noisy image at step t, it predicts what noise was added, and we subtract it to get a slightly cleaner image at step t-1. Repeat this T times to go from pure noise to a clean image.
Training
For each training image:
1. Pick a random timestep t (e.g., t=347)
2. Add noise corresponding to step t
3. Ask the network: “What noise was added?”
4. Loss = MSE between predicted noise and actual noise

Simple, stable, no adversarial training. Just predict the noise.
Generation
// Generating an image from scratch x_T = sample pure Gaussian noise for t = T, T-1, ..., 1: noise_pred = model(x_t, t) // predict noise x_{t-1} = remove_noise(x_t, noise_pred) x_0 = final clean image done! // Each step: a small, learnable refinement // Like a sculptor removing material gradually
Key insight: Each denoising step is a small, learnable transformation. The model doesn’t generate the whole image at once — it refines it gradually over many steps. This iterative refinement is why diffusion models produce such high-quality, detailed images.
architecture
The U-Net Architecture
The workhorse neural network inside diffusion models
Why U-Net?
The U-Net has an encoder-decoder structure with skip connections, shaped like the letter U:

Encoder (downsampling): Compresses spatial dimensions, increases channels — captures “what” is in the image
Bottleneck: Self-attention at lowest resolution — global context
Decoder (upsampling): Restores spatial dimensions — generates details
Skip connections: Connect encoder to decoder at each level — preserve fine details that would otherwise be lost
Cross-Attention: Where Text Meets Image
The U-Net’s cross-attention layers are where CLIP text embeddings are injected. At each resolution level, the model attends to the text embeddings, allowing the text prompt to guide what gets generated. This is the mechanism behind text-to-image.
The DiT Alternative
Diffusion Transformers (DiT) replace the U-Net with a pure Transformer architecture. Used in Sora, SD3, and Flux. Advantages: better scaling with compute, simpler architecture, stronger global attention. Disadvantage: more compute-intensive per step.
Key insight: The U-Net’s cross-attention layers are where text meets image. CLIP text embeddings are injected here, telling the denoiser what to generate. This is why prompt quality matters so much — the cross-attention mechanism is literally “reading” your prompt at every denoising step.
compress
Latent Diffusion
Running diffusion in compressed space for 10–100x efficiency
The Efficiency Problem
Running diffusion directly on 1024×1024 images requires processing 3.1 million values at each of 50+ denoising steps. That’s enormous compute — impractical for consumer hardware.
The Solution: Compress First
// Latent Diffusion pipeline 1. VAE Encoder (pre-trained, frozen) 1024×1024 image → 128×128 latent 64x spatial compression 2. Diffusion in Latent Space Add/remove noise on 128×128 latents 10-100x cheaper per step 3. VAE Decoder (pre-trained, frozen) 128×128 latent → 1024×1024 image // This is literally what "Stable Diffusion" is // Paper: "High-Resolution Image Synthesis // with Latent Diffusion Models" (2022)
Why It Works
Perceptual compression: The VAE preserves perceptually important information while discarding imperceptible details
Semantic latent space: Nearby points in latent space correspond to similar images — diffusion operates in a meaningful space
Decoupled training: VAE is trained once; diffusion model is trained separately in the latent space
Consumer-friendly: Runs on 8GB VRAM GPUs instead of requiring data-center hardware
Key insight: Latent diffusion is why Stable Diffusion can run on your laptop. Without VAE compression, you’d need data-center hardware for every image. The VAE handles the “pixel details” while diffusion handles the “creative generation” in compressed space.
tune
Classifier-Free Guidance (CFG)
The “creativity dial” that controls prompt adherence
The Problem
Diffusion models can generate diverse images, but how do you make them follow a specific text prompt? Without guidance, the model generates random samples from the training distribution. Classifier-Free Guidance (CFG) amplifies the influence of the text condition.
How CFG Works
At each denoising step, run the model twice:
1. Conditional: Denoise with the text prompt
2. Unconditional: Denoise without any prompt (empty text)

Final prediction = unconditional + guidance_scale × (conditional − unconditional)

The guidance scale amplifies the “direction” the text pushes the generation.
Guidance Scale Effects
// CFG scale = the "creativity dial" scale = 1.0 No guidance (random, diverse) scale = 3.0 Mild guidance (creative, loose) scale = 7.5 Default ← sweet spot for most use scale = 12.0 Strong (very prompt-faithful) scale = 20.0 Over-saturated, artifacts // Higher = more prompt adherence, less diversity // Lower = more creative, less controlled // Cost: 2x inference (conditional + unconditional)
Key insight: CFG is the most important user-facing parameter in image generation. When your images look “too generic,” increase CFG. When they look “over-saturated” or have artifacts, decrease it. 7.5 is the sweet spot for most use cases.
speed
Sampling & Speed
From 1,000 steps to real-time generation
The Speed Problem
Original DDPM: 1,000 denoising steps per image. At ~50ms per step on a GPU, that’s 50 seconds per image. Modern sampling techniques dramatically reduce this:
Sampling Methods
// Evolution of sampling speed DDPM (2020) 1000 steps ~50 sec DDIM (2021) 50 steps ~3 sec DPM-Solver (2022) 20 steps ~1.5 sec LCM (2023) 4-8 steps ~0.5 sec Consistency (2023) 1-4 steps ~0.2 sec SDXL Turbo (2023) 1 step ~0.1 sec // 500x speedup in 3 years!
Real-Time Generation
With consistency models and optimized inference:

SDXL Turbo: 1–4 steps, ~0.1 seconds per image
LCM-LoRA: 4 steps, ~0.5 seconds
StreamDiffusion: Real-time at 30+ FPS
FLUX.1-schnell: 4 steps, high quality

We’ve gone from minutes to milliseconds in 3 years. This enables entirely new applications: interactive design, live video effects, and responsive creative tools.
Key insight: The speed improvements in diffusion models are as important as the quality improvements. Real-time generation enables interactive applications that were impossible with 50-second generation times. The tradeoff is usually quality vs speed — fewer steps = faster but slightly lower quality.
settings
Noise Schedules & Flow Matching
The details that make diffusion work — and the next evolution
Noise Schedules
The noise schedule (β_t) controls how much noise is added at each step. The choice matters:

Linear: Simple, works okay but wastes steps on very noisy/very clean regions
Cosine: Better distribution of noise levels, used in improved DDPM
Sigmoid: Used in newer models, concentrates steps where they matter most

The schedule affects both training stability and generation quality.
Flow Matching: The Next Evolution
Flow Matching (used in Stable Diffusion 3 and Flux) replaces the noise schedule with learned straight-line paths from noise to data. Instead of gradually adding/removing noise, the model learns to transport samples along optimal paths.

Simpler math: No noise schedule to tune
Faster convergence: Straighter paths = fewer steps needed
Better quality: More efficient use of each denoising step
Flexible: Can interpolate between any two distributions
Key insight: Flow Matching is to diffusion what DDIM was to DDPM — a more efficient way to traverse the same generative process. SD3 and Flux use Flow Matching and produce better results with fewer steps. This is the current state of the art.
school
Key Takeaways
The essential diffusion model concepts
Remember These
1. Forward: Add noise gradually until pure noise (fixed, no learning)

2. Reverse: Learn to denoise step by step (the neural network’s job)

3. U-Net: Encoder-decoder with skip connections and cross-attention for text

4. Latent: Compress with VAE first for 64x efficiency (Stable Diffusion)

5. CFG: Amplify text conditioning — guidance scale 7.5 is the sweet spot

6. Sampling: Modern methods need only 4–50 steps (down from 1,000)
Why This Matters
Every image and video generation tool you use — Stable Diffusion, DALL-E, Midjourney, Sora, Flux — runs this pipeline. Understanding it helps you:

Write better prompts (understanding what CFG and cross-attention do)
Choose the right settings (steps, guidance scale, sampler)
Debug generation failures (artifacts, wrong content, low quality)
Evaluate new models (what changed: U-Net vs DiT, noise schedule vs flow matching)
Next up: Chapter 6 puts diffusion into practice with text-to-image generation — Stable Diffusion, DALL-E 3, Midjourney, Flux, ControlNet, inpainting, and the creative workflow.