Remember These
1. Forward: Add noise gradually until pure noise (fixed, no learning)
2. Reverse: Learn to denoise step by step (the neural network’s job)
3. U-Net: Encoder-decoder with skip connections and cross-attention for text
4. Latent: Compress with VAE first for 64x efficiency (Stable Diffusion)
5. CFG: Amplify text conditioning — guidance scale 7.5 is the sweet spot
6. Sampling: Modern methods need only 4–50 steps (down from 1,000)
Why This Matters
Every image and video generation tool you use — Stable Diffusion, DALL-E, Midjourney, Sora, Flux — runs this pipeline. Understanding it helps you:
• Write better prompts (understanding what CFG and cross-attention do)
• Choose the right settings (steps, guidance scale, sampler)
• Debug generation failures (artifacts, wrong content, low quality)
• Evaluate new models (what changed: U-Net vs DiT, noise schedule vs flow matching)
Next up: Chapter 6 puts diffusion into practice with text-to-image generation — Stable Diffusion, DALL-E 3, Midjourney, Flux, ControlNet, inpainting, and the creative workflow.