The Architecture
// Stable Diffusion = Latent Diffusion Model
Step 1: VAE Encoder
1024×1024 image → 128×128 latent
(64x spatial compression)
Step 2: CLIP Text Encoder
"a sunset over mountains" → text embeddings
Step 3: Diffusion in Latent Space
Start with noise in 128×128 latent space
Denoise with U-Net conditioned on text embeddings
(50 steps with DPM-Solver)
Step 4: VAE Decoder
128×128 latent → 1024×1024 image
Why This Works So Well
• Efficiency: Diffusion operates on 128×128 latents instead of 1024×1024 pixels — 64x less data to process at each step
• Quality: The VAE preserves perceptual quality; the diffusion model handles the creative generation
• Controllability: CLIP text embeddings guide generation through cross-attention in the U-Net
• Accessibility: Runs on consumer GPUs (8GB VRAM) instead of requiring data-center hardware
Key insight: Latent Diffusion is the single most important architecture in generative AI. Understanding its three components — VAE (compression), CLIP (text understanding), and U-Net (diffusion) — unlocks understanding of Stable Diffusion, DALL-E 3, and most modern image generators.