Ch 11: Generative AI

Ch 11 — Generative AI

Images, video, audio — from GANs to diffusion models and beyond

Index Under the Hood →

High Level

auto_awesome

GenAI

arrow_forward

smart_toy

GANs

arrow_forward

grain

Diffusion

arrow_forward

image

Text2Img

arrow_forward

movie

Video

arrow_forward

music_note

Audio

Click play or press Space to begin the journey...

Step- / 8

auto_awesome

What Is Generative AI?

Models that create new content — text, images, video, audio, code

The Big Picture

Generative AI creates new content that resembles its training data. Unlike discriminative models (which classify inputs), generative models learn the underlying data distribution and sample from it. Ch 10 covered text generation (LLMs). This chapter covers everything else: images, video, audio, and multimodal generation.

# Generative model families Autoregressive: GPT, LLMs (Ch 10) Generate one token at a time GANs: Generator vs discriminator Generate entire image at once VAEs: Encode → latent → decode Learn compressed representations Diffusion: Denoise from pure noise Iteratively refine over many steps Flow Matching: Learn transport paths Modern alternative to diffusion

The Timeline

2014 GANs (Goodfellow) 2014 VAEs (Kingma & Welling) 2017 Transformer (Vaswani) 2020 DDPM (Ho et al.) — diffusion works! 2021 DALL-E (OpenAI) — text-to-image 2022 Stable Diffusion — open source 2022 ChatGPT — text generation 2023 Midjourney v5 — photorealistic 2024 Sora (OpenAI) — text-to-video 2025 Sora 2 — 1080p + audio

The shift: GANs dominated image generation from 2014–2021. Diffusion models overtook them by 2022 with better quality, diversity, and controllability. Now, diffusion + transformers is the dominant paradigm for all visual generation.

smart_toy

GANs: Generative Adversarial Networks

Goodfellow et al. (2014) — the generator-discriminator game

How GANs Work

Two networks compete: the generator creates fake images from random noise, and the discriminator tries to distinguish real from fake. As the discriminator improves, the generator must produce more realistic images to fool it. At equilibrium, the generator produces images indistinguishable from real ones.

# GAN training loop Generator G: noise z → fake image Discriminator D: image → real or fake? # Minimax game: min_G max_D [E[log D(real)] + E[log(1-D(G(z)))]] # D wants to maximize: classify correctly # G wants to minimize: fool D

GAN Evolution

GAN (2014): Blurry, small images DCGAN (2015): CNNs, stable training ProGAN (2018): Progressive growing StyleGAN (2019): Photorealistic faces StyleGAN2 (2020): Near-perfect faces StyleGAN3 (2021): Alias-free generation

GAN limitations: Mode collapse — generator produces limited variety. Training instability — delicate balance between G and D. No density estimation — can’t compute P(image). These limitations drove the shift to diffusion models, which are more stable and diverse.

grain

Diffusion Models

Learn to denoise — the dominant paradigm for image generation

The Core Idea

Forward process: Gradually add Gaussian noise to an image over T steps until it becomes pure noise. Reverse process: Train a neural network to predict and remove the noise at each step. At inference, start from pure noise and iteratively denoise to generate a new image.

# Diffusion: two processes Forward (destroy): image → +noise → +noise → ... → pure noise x\u2080 → x\u2081 → x\u2082 → ... → x\u209c (Gaussian noise) # Fixed, no learning needed Reverse (create): pure noise → -noise → -noise → ... → image x\u209c → x\u209c\u208b\u2081 → ... → x\u2080 (generated image) # Learned! Neural network predicts noise Training: ε-prediction Add noise to image, predict what noise was added L = ||ε - ε\u03b8(x\u209c, t)||²

Why Diffusion Won

vs GANs: More stable training, better diversity (no mode collapse), controllable via text conditioning.

vs VAEs: Much higher image quality, sharper details.

The tradeoff: Diffusion requires many denoising steps (20–1000), making generation slower than GANs. But quality and controllability outweigh speed.

DDPM (Ho et al., 2020) proved diffusion models could match GAN quality. Latent Diffusion (Rombach et al., 2022) moved diffusion to a compressed latent space (via VAE encoder), making it 10–100x faster. This became Stable Diffusion — the open-source model that democratized image generation.

image

Text-to-Image Generation

DALL-E, Stable Diffusion, Midjourney — from words to pictures

How Text Guides Image Generation

A text encoder (CLIP or T5) converts the prompt into embeddings. These embeddings condition the diffusion model via cross-attention — at each denoising step, the model attends to the text to decide what to generate. Classifier-free guidance (CFG) amplifies the text signal for better prompt adherence.

# Text-to-image pipeline 1. "A cat astronaut on Mars" → CLIP encoder 2. Text embeddings condition the U-Net/DiT 3. Start from random noise in latent space 4. Denoise for 20–50 steps with text guidance 5. VAE decoder: latent → pixel image CFG scale (guidance): 1.0 = ignore text (random images) 7.5 = balanced (default) 15+ = strong adherence (may over-saturate)

Key Models

DALL-E 2 (2022): CLIP + diffusion Stable Diffusion: Latent diffusion, open SD 1.5, SDXL, SD 3.0 (DiT-based) Midjourney: Proprietary, aesthetic DALL-E 3 (2023): Better text rendering Imagen (Google): T5 text encoder FLUX (2024): Flow matching + DiT

The architecture shift: Early models used U-Net (CNN-based denoiser). Modern models like SD 3.0 and FLUX use Diffusion Transformers (DiT) — replacing the U-Net with a transformer. This scales better and produces higher-quality images, following the same “transformers win” pattern as NLP.

movie

Video Generation

From static images to temporal coherence

The Challenge

Video generation must produce temporally coherent frames — objects must move smoothly, physics must be plausible, and style must remain consistent. This is much harder than image generation: a 10-second 30fps video at 1080p has 300 frames, each needing to be consistent with all others.

Sora (OpenAI, 2024): Diffusion transformer on spacetime patches Up to 60 seconds, 1080p Understands 3D space and physics Sora 2 (2025): Synchronized audio generation Superior physics realism 1080p, 16–20 seconds via API Other models: Runway Gen-3, Pika, Kling, Veo (Google) Open source: CogVideo, Open-Sora

How Video Models Work

Most video models extend image diffusion to 3D: the denoiser processes spacetime patches (spatial + temporal). Temporal attention layers ensure frame-to-frame consistency. Some models generate keyframes first, then interpolate. The compute cost is enormous — video generation requires 100–1000x more compute than image generation.

The frontier: Video generation is where image generation was in 2021 — impressive but not yet reliable. Challenges include consistent character identity, accurate physics, long-duration coherence, and real-time generation. Rapid progress suggests these will be solved within 1–2 years.

music_note

Audio & Music Generation

Text-to-speech, music, and sound effects

Text-to-Speech (TTS): ElevenLabs, OpenAI TTS, Bark Clone voices from seconds of audio Emotional control, multilingual Music Generation: Suno, Udio — full songs from text MusicLM (Google), MusicGen (Meta) Lyrics, melody, instruments, vocals Sound Effects: AudioGen, Make-An-Audio "Thunder during a rainstorm" → audio Speech-to-Speech: GPT-4o — native audio understanding Real-time conversation with emotion

Audio Model Architectures

Audio models typically use one of two approaches: codec-based (compress audio to discrete tokens, then use an LLM to generate tokens) or diffusion-based (denoise spectrograms or waveforms). Codec models (like Suno) treat audio generation as a language modeling problem — the same next-token prediction that powers text LLMs.

Voice cloning can now replicate a person’s voice from 3–15 seconds of audio. This has enormous potential for accessibility (restoring lost voices) and entertainment, but also creates risks for fraud and impersonation. Most platforms now require consent verification.

hub

Multimodal & Beyond

Models that understand and generate across modalities

Multimodal understanding: GPT-4V/o — text + image + audio input Gemini — native multimodal Claude — text + image input Multimodal generation: GPT-4o — generates text + images + audio Gemini 2.0 — text + image + audio output Meta Chameleon — any-to-any generation 3D generation: Point-E, Shap-E (OpenAI) — text to 3D Gaussian splatting — 3D from images NeRF — neural radiance fields Code generation: Cursor, GitHub Copilot, Devin LLMs specialized for programming

The Convergence

The trend is clear: models are becoming natively multimodal. Rather than separate models for text, image, and audio, frontier systems process and generate all modalities in a unified architecture. GPT-4o processes text, images, and audio in a single model with shared representations.

The “any-to-any” future: Input any combination of text, image, audio, video, 3D → output any combination. This is where the field is heading. The transformer architecture is flexible enough to handle all modalities through tokenization — everything becomes a sequence of tokens.

palette

Impact & Key Takeaways

The creative AI revolution and what comes next

Societal Impact

Creative tools: Artists use AI as a collaborator — concept art, storyboarding, music production, game assets.

Democratization: Anyone can create professional-quality images, music, and video without years of training.

Concerns: Deepfakes, copyright disputes (trained on artists’ work), job displacement in creative industries, misinformation through synthetic media.

Key Takeaways

1. GANs pioneered image generation but suffer from mode collapse and instability

2. Diffusion models denoise from random noise — now the dominant paradigm

3. Latent diffusion (Stable Diffusion) made generation fast and accessible

4. Text conditioning via CLIP/T5 + cross-attention enables text-to-image

5. Video extends diffusion to spacetime patches (Sora)

6. Audio uses codec tokens or spectrogram diffusion

7. The future is natively multimodal — any-to-any generation

Coming up: Ch 12 covers Reinforcement Learning in depth — from Q-learning to AlphaGo to RLHF, the paradigm that teaches agents to act in environments and aligns LLMs with human preferences.