Ch 9: Generative Adversarial Networks — Deep Learning Fundamentals

Ch 9 — Generative Adversarial Networks

Generator vs. discriminator, the min-max game, DCGAN, StyleGAN, and training instability

Index

High Level

casino

Noise z

arrow_forward

auto_fix_high

Generator

arrow_forward

image

Fake

arrow_forward

gavel

Discriminator

arrow_forward

check_circle

Real/Fake

arrow_forward

replay

Feedback

Click play or press Space to begin...

Step- / 8

auto_fix_high

The Adversarial Idea

Goodfellow et al. (2014) — two networks in competition

The Counterfeiter and the Detective

In June 2014, Ian Goodfellow introduced Generative Adversarial Networks (GANs) with a brilliant analogy: a generator (counterfeiter) tries to create fake data that looks real, while a discriminator (detective) tries to distinguish real data from fakes. They train simultaneously in a minimax game. As the discriminator gets better at spotting fakes, the generator must produce more convincing outputs. At equilibrium, the generator produces data indistinguishable from real data, and the discriminator outputs 0.5 (can’t tell the difference).

The Min-Max Objective

// GAN objective (Goodfellow, 2014) min_G max_D V(D, G) = E[log D(x)] // real → 1 + E[log(1 - D(G(z)))] // fake → 0 // D wants to maximize: correctly classify // G wants to minimize: fool D // z ~ N(0, 1): random noise input // G(z): generated fake sample // D(x): probability x is real

Key insight: GANs require no explicit density estimation or reconstruction loss. The generator never sees real data directly — it only receives gradient signals from the discriminator. This adversarial training produces remarkably sharp outputs.

view_in_ar

DCGAN: Making GANs Work with Images

Radford et al. (2016) — architectural guidelines for stable training

The Architecture That Worked

Early GANs used fully connected layers and produced blurry, low-resolution images. DCGAN (Deep Convolutional GAN, Radford et al., 2016) established architectural guidelines that made GANs work reliably for images: use strided convolutions (not pooling) in the discriminator, transposed convolutions in the generator, batch normalization in both (except the output layer), ReLU in the generator, and LeakyReLU in the discriminator. These guidelines became the foundation for all subsequent GAN architectures.

DCGAN Generator

// DCGAN Generator (noise → image) z (100-dim noise) → ConvTranspose(512, 4×4) // 1→4 → BN → ReLU → ConvTranspose(256, 4×4) // 4→8 → BN → ReLU → ConvTranspose(128, 4×4) // 8→16 → BN → ReLU → ConvTranspose(64, 4×4) // 16→32 → BN → ReLU → ConvTranspose(3, 4×4) // 32→64 → Tanh // output [-1, 1]

Key insight: DCGAN showed that the latent space z has meaningful structure. Walking in latent space produces smooth transitions: “man with glasses” - “man” + “woman” = “woman with glasses.” This arithmetic in latent space was a stunning demonstration of learned representations.

warning

Training Instability & Mode Collapse

Why GANs are notoriously hard to train

The Balancing Act

GAN training is a delicate balance. If the discriminator is too strong, it provides no useful gradient signal to the generator (vanishing gradients). If the generator is too strong, the discriminator can’t learn. Mode collapse is the most common failure: the generator finds a few outputs that fool the discriminator and keeps producing only those, ignoring the diversity of real data. The training dynamics are a non-stationary game where each update changes the optimization landscape for the other player.

Critical in AI: Unlike standard loss minimization, GAN training has no single loss to monitor. The generator and discriminator losses oscillate, and low generator loss doesn’t guarantee good images. Visual inspection and metrics like FID are essential.

Common Failure Modes

// Mode collapse // Generator produces only 1-2 types of faces // regardless of input noise z // Vanishing gradients // Discriminator too good → D(G(z)) ≈ 0 // log(1 - 0) ≈ 0 → no gradient for G // Training oscillation // G and D take turns "winning" // Never reach equilibrium // Solutions: ✓ Wasserstein loss (WGAN, 2017) ✓ Spectral normalization (Miyato, 2018) ✓ Progressive growing (Karras, 2018) ✓ Two time-scale update rule (TTUR)

balance

Wasserstein GAN (WGAN)

A better loss function for stable training

The Earth Mover’s Distance

Arjovsky et al. (2017) identified that the original GAN loss (Jensen-Shannon divergence) can be flat when the generator and real distributions don’t overlap, providing no gradient. WGAN replaces this with the Wasserstein distance (Earth Mover’s Distance) — intuitively, the minimum “work” needed to transform one distribution into another. This loss is smooth everywhere, provides meaningful gradients even when distributions are far apart, and correlates with image quality. WGAN-GP (Gulrajani et al., 2017) added a gradient penalty for even more stable training.

WGAN Loss

// WGAN objective Critic (not "discriminator"): max E[C(x)] - E[C(G(z))] Generator: max E[C(G(z))] // C outputs a score (not probability) // No sigmoid on critic output // WGAN-GP: add gradient penalty L_critic += λ · E[(||∇C(x̂)||₂ - 1)²] // x̂ = interpolation between real and fake // λ = 10 (standard)

Key insight: WGAN’s critic loss directly correlates with image quality — as the loss decreases, images improve. This was the first time GAN training had a meaningful metric to monitor, making hyperparameter tuning far more practical.

face

StyleGAN: Photorealistic Face Generation

Karras et al. (2019) — style-based generator architecture

The Style-Based Generator

NVIDIA’s StyleGAN (Karras et al., 2019) produced the first truly photorealistic AI-generated faces at 1024×1024 resolution. Its key innovation: instead of feeding noise z directly to the generator, it maps z through a mapping network to a style vector w, which is then injected into each layer via Adaptive Instance Normalization (AdaIN). Different layers control different levels of detail: early layers control pose and face shape, middle layers control features like eyes and nose, and late layers control fine details like skin texture and hair.

StyleGAN Evolution

// StyleGAN family StyleGAN (2019): 1024×1024 faces - Mapping network z → w - AdaIN style injection per layer - Progressive growing StyleGAN2 (2020): Fixed artifacts - Weight demodulation (no AdaIN) - Path length regularization - No progressive growing needed StyleGAN3 (2021): Alias-free - Continuous equivariance - Smooth video generation // FID scores (lower = better): // StyleGAN: 4.40, StyleGAN2: 2.84

Key insight: StyleGAN’s style mixing capability — taking coarse styles from one face and fine styles from another — proved that the generator learned a hierarchical, disentangled representation of faces without any explicit supervision.

image

Conditional GANs & Applications

Controlling what GANs generate

Conditional Generation

A conditional GAN (cGAN) provides additional information (class label, text description, input image) to both generator and discriminator, controlling what is generated. Pix2Pix (Isola et al., 2017) translates between image domains (sketches → photos, day → night). CycleGAN (Zhu et al., 2017) does unpaired image translation (horses → zebras) using cycle consistency loss. GauGAN (Park et al., 2019) turns semantic segmentation maps into photorealistic landscapes.

GAN Applications

// Major GAN applications Image synthesis: StyleGAN faces, bedrooms, cars Image-to-image translation: Pix2Pix, CycleGAN, GauGAN Super-resolution: SRGAN, ESRGAN (upscale images) Data augmentation: Generate training data for rare classes Video generation: StyleGAN-V, MoCoGAN 3D generation: EG3D (3D-aware face generation)

Rule of thumb: GANs excel at producing sharp, high-quality single images. For text-conditioned generation (text-to-image), diffusion models (DALL-E, Stable Diffusion) have largely replaced GANs due to better diversity and controllability.

analytics

Evaluating GANs

FID, IS, and why evaluation is hard

The Evaluation Challenge

How do you measure if generated images are “good”? There’s no single loss to minimize. The Fréchet Inception Distance (FID) compares the distribution of generated images to real images using features from a pretrained Inception network. Lower FID means the generated distribution is closer to real. The Inception Score (IS) measures both quality (confident classifications) and diversity (spread across classes). FID is the standard metric, but it requires thousands of samples and doesn’t capture all aspects of quality.

Metrics

// FID (Fréchet Inception Distance) FID = ||μ_r - μ_g||² + Tr(Σ_r + Σ_g - 2√(Σ_r·Σ_g)) // μ_r, Σ_r: mean/cov of real features // μ_g, Σ_g: mean/cov of generated features // Lower = better (0 = identical distributions) // Typical FID scores (FFHQ faces): DCGAN: ~30 StyleGAN: ~4.4 StyleGAN2: ~2.8 Diffusion: ~2.0

Key insight: FID measures distribution similarity, not individual image quality. A GAN can produce a few stunning images but have high FID due to poor diversity (mode collapse). Always evaluate both quality and diversity.

school

GANs’ Legacy & What’s Next

From adversarial training to diffusion models

The GAN Era (2014–2021)

GANs dominated image generation from 2014 to ~2021, producing increasingly photorealistic results. They proved that neural networks could create, not just classify. But their training instability, mode collapse, and difficulty with text conditioning led to their gradual replacement by diffusion models for most generation tasks. GANs remain relevant for real-time applications (super-resolution, video enhancement) where diffusion’s slow sampling is prohibitive.

The connection: GANs taught us that adversarial training produces sharp outputs, and that latent spaces can encode meaningful structure. These ideas influenced diffusion models, contrastive learning, and even LLM alignment (RLHF uses a reward model reminiscent of a discriminator). Next: Regularization & Practical Training.

GAN Timeline

// GAN milestones 2014: Original GAN (Goodfellow) 2016: DCGAN (convolutional GANs) 2017: WGAN (Wasserstein distance) 2017: Pix2Pix, CycleGAN 2018: Progressive GAN (1024×1024) 2019: StyleGAN (photorealistic faces) 2019: BigGAN (class-conditional, ImageNet) 2020: StyleGAN2 (artifact-free) 2021: Diffusion models begin to dominate 2022: DALL-E 2, Stable Diffusion replace GANs

arrow_back Ch 8: Autoencoders Ch 10: Regularization & Practical Training arrow_forward