Ch 11 — Generative AI

Images, video, audio — from GANs to diffusion models and beyond
High Level
auto_awesome
GenAI
arrow_forward
smart_toy
GANs
arrow_forward
grain
Diffusion
arrow_forward
image
Text2Img
arrow_forward
movie
Video
arrow_forward
music_note
Audio
-
Click play or press Space to begin the journey...
Step- / 8
auto_awesome
What Is Generative AI?
Models that create new content — text, images, video, audio, code
The Big Picture
Generative AI creates new content that resembles its training data. Unlike discriminative models (which classify inputs), generative models learn the underlying data distribution and sample from it. Ch 10 covered text generation (LLMs). This chapter covers everything else: images, video, audio, and multimodal generation.
# Generative model families Autoregressive: GPT, LLMs (Ch 10) Generate one token at a time GANs: Generator vs discriminator Generate entire image at once VAEs: Encode → latent → decode Learn compressed representations Diffusion: Denoise from pure noise Iteratively refine over many steps Flow Matching: Learn transport paths Modern alternative to diffusion
The Timeline
2014 GANs (Goodfellow) 2014 VAEs (Kingma & Welling) 2017 Transformer (Vaswani) 2020 DDPM (Ho et al.) — diffusion works! 2021 DALL-E (OpenAI) — text-to-image 2022 Stable Diffusion — open source 2022 ChatGPT — text generation 2023 Midjourney v5 — photorealistic 2024 Sora (OpenAI) — text-to-video 2025 Sora 2 — 1080p + audio
The shift: GANs dominated image generation from 2014–2021. Diffusion models overtook them by 2022 with better quality, diversity, and controllability. Now, diffusion + transformers is the dominant paradigm for all visual generation.
smart_toy
GANs: Generative Adversarial Networks
Goodfellow et al. (2014) — the generator-discriminator game
How GANs Work
Two networks compete: the generator creates fake images from random noise, and the discriminator tries to distinguish real from fake. As the discriminator improves, the generator must produce more realistic images to fool it. At equilibrium, the generator produces images indistinguishable from real ones.
# GAN training loop Generator G: noise z → fake image Discriminator D: image → real or fake? # Minimax game: min_G max_D [E[log D(real)] + E[log(1-D(G(z)))]] # D wants to maximize: classify correctly # G wants to minimize: fool D
GAN Evolution
GAN (2014): Blurry, small images DCGAN (2015): CNNs, stable training ProGAN (2018): Progressive growing StyleGAN (2019): Photorealistic faces StyleGAN2 (2020): Near-perfect faces StyleGAN3 (2021): Alias-free generation
GAN limitations: Mode collapse — generator produces limited variety. Training instability — delicate balance between G and D. No density estimation — can’t compute P(image). These limitations drove the shift to diffusion models, which are more stable and diverse.
grain
Diffusion Models
Learn to denoise — the dominant paradigm for image generation
The Core Idea
Forward process: Gradually add Gaussian noise to an image over T steps until it becomes pure noise. Reverse process: Train a neural network to predict and remove the noise at each step. At inference, start from pure noise and iteratively denoise to generate a new image.
# Diffusion: two processes Forward (destroy): image → +noise → +noise → ... → pure noise x\u2080 → x\u2081 → x\u2082 → ... → x\u209c (Gaussian noise) # Fixed, no learning needed Reverse (create): pure noise → -noise → -noise → ... → image x\u209c → x\u209c\u208b\u2081 → ... → x\u2080 (generated image) # Learned! Neural network predicts noise Training: ε-prediction Add noise to image, predict what noise was added L = ||ε - ε\u03b8(x\u209c, t)||²
Why Diffusion Won
vs GANs: More stable training, better diversity (no mode collapse), controllable via text conditioning.

vs VAEs: Much higher image quality, sharper details.

The tradeoff: Diffusion requires many denoising steps (20–1000), making generation slower than GANs. But quality and controllability outweigh speed.
DDPM (Ho et al., 2020) proved diffusion models could match GAN quality. Latent Diffusion (Rombach et al., 2022) moved diffusion to a compressed latent space (via VAE encoder), making it 10–100x faster. This became Stable Diffusion — the open-source model that democratized image generation.
image
Text-to-Image Generation
DALL-E, Stable Diffusion, Midjourney — from words to pictures
How Text Guides Image Generation
A text encoder (CLIP or T5) converts the prompt into embeddings. These embeddings condition the diffusion model via cross-attention — at each denoising step, the model attends to the text to decide what to generate. Classifier-free guidance (CFG) amplifies the text signal for better prompt adherence.
# Text-to-image pipeline 1. "A cat astronaut on Mars" → CLIP encoder 2. Text embeddings condition the U-Net/DiT 3. Start from random noise in latent space 4. Denoise for 20–50 steps with text guidance 5. VAE decoder: latent → pixel image CFG scale (guidance): 1.0 = ignore text (random images) 7.5 = balanced (default) 15+ = strong adherence (may over-saturate)
Key Models
DALL-E 2 (2022): CLIP + diffusion Stable Diffusion: Latent diffusion, open SD 1.5, SDXL, SD 3.0 (DiT-based) Midjourney: Proprietary, aesthetic DALL-E 3 (2023): Better text rendering Imagen (Google): T5 text encoder FLUX (2024): Flow matching + DiT
The architecture shift: Early models used U-Net (CNN-based denoiser). Modern models like SD 3.0 and FLUX use Diffusion Transformers (DiT) — replacing the U-Net with a transformer. This scales better and produces higher-quality images, following the same “transformers win” pattern as NLP.
movie
Video Generation
From static images to temporal coherence
The Challenge
Video generation must produce temporally coherent frames — objects must move smoothly, physics must be plausible, and style must remain consistent. This is much harder than image generation: a 10-second 30fps video at 1080p has 300 frames, each needing to be consistent with all others.
Sora (OpenAI, 2024): Diffusion transformer on spacetime patches Up to 60 seconds, 1080p Understands 3D space and physics Sora 2 (2025): Synchronized audio generation Superior physics realism 1080p, 16–20 seconds via API Other models: Runway Gen-3, Pika, Kling, Veo (Google) Open source: CogVideo, Open-Sora
How Video Models Work
Most video models extend image diffusion to 3D: the denoiser processes spacetime patches (spatial + temporal). Temporal attention layers ensure frame-to-frame consistency. Some models generate keyframes first, then interpolate. The compute cost is enormous — video generation requires 100–1000x more compute than image generation.
The frontier: Video generation is where image generation was in 2021 — impressive but not yet reliable. Challenges include consistent character identity, accurate physics, long-duration coherence, and real-time generation. Rapid progress suggests these will be solved within 1–2 years.
music_note
Audio & Music Generation
Text-to-speech, music, and sound effects
Text-to-Speech (TTS): ElevenLabs, OpenAI TTS, Bark Clone voices from seconds of audio Emotional control, multilingual Music Generation: Suno, Udio — full songs from text MusicLM (Google), MusicGen (Meta) Lyrics, melody, instruments, vocals Sound Effects: AudioGen, Make-An-Audio "Thunder during a rainstorm" → audio Speech-to-Speech: GPT-4o — native audio understanding Real-time conversation with emotion
Audio Model Architectures
Audio models typically use one of two approaches: codec-based (compress audio to discrete tokens, then use an LLM to generate tokens) or diffusion-based (denoise spectrograms or waveforms). Codec models (like Suno) treat audio generation as a language modeling problem — the same next-token prediction that powers text LLMs.
Voice cloning can now replicate a person’s voice from 3–15 seconds of audio. This has enormous potential for accessibility (restoring lost voices) and entertainment, but also creates risks for fraud and impersonation. Most platforms now require consent verification.
hub
Multimodal & Beyond
Models that understand and generate across modalities
Multimodal understanding: GPT-4V/o — text + image + audio input Gemini — native multimodal Claude — text + image input Multimodal generation: GPT-4o — generates text + images + audio Gemini 2.0 — text + image + audio output Meta Chameleon — any-to-any generation 3D generation: Point-E, Shap-E (OpenAI) — text to 3D Gaussian splatting — 3D from images NeRF — neural radiance fields Code generation: Cursor, GitHub Copilot, Devin LLMs specialized for programming
The Convergence
The trend is clear: models are becoming natively multimodal. Rather than separate models for text, image, and audio, frontier systems process and generate all modalities in a unified architecture. GPT-4o processes text, images, and audio in a single model with shared representations.
The “any-to-any” future: Input any combination of text, image, audio, video, 3D → output any combination. This is where the field is heading. The transformer architecture is flexible enough to handle all modalities through tokenization — everything becomes a sequence of tokens.
palette
Impact & Key Takeaways
The creative AI revolution and what comes next
Societal Impact
Creative tools: Artists use AI as a collaborator — concept art, storyboarding, music production, game assets.

Democratization: Anyone can create professional-quality images, music, and video without years of training.

Concerns: Deepfakes, copyright disputes (trained on artists’ work), job displacement in creative industries, misinformation through synthetic media.
Key Takeaways
1. GANs pioneered image generation but suffer from mode collapse and instability

2. Diffusion models denoise from random noise — now the dominant paradigm

3. Latent diffusion (Stable Diffusion) made generation fast and accessible

4. Text conditioning via CLIP/T5 + cross-attention enables text-to-image

5. Video extends diffusion to spacetime patches (Sora)

6. Audio uses codec tokens or spectrogram diffusion

7. The future is natively multimodal — any-to-any generation
Coming up: Ch 12 covers Reinforcement Learning in depth — from Q-learning to AlphaGo to RLHF, the paradigm that teaches agents to act in environments and aligns LLMs with human preferences.