Foundations (Ch 1–4)
How machines see (CNNs to Vision Transformers), the generative model family tree (VAEs, GANs, Diffusion), contrastive learning & CLIP — the backbone that connects text and images.
Generation (Ch 5–8)
How diffusion models work (forward/reverse process, U-Net, CFG), text-to-image (Stable Diffusion, DALL-E, Midjourney), text-to-video (Sora, DiT), and speech & audio AI (Whisper, ElevenLabs, music generation).
Understanding & Applications (Ch 9–14)
Vision-language models (GPT-4V, Gemini, LLaVA), the model landscape, multimodal embeddings & search, training multimodal models, building multimodal apps, and multimodal agents.
Responsibility & Future (Ch 15–17)
Ethics and deepfakes (C2PA, watermarking), evaluating multimodal models (FID, CLIP score), and the future — world models, embodied AI, and universal multimodal systems.
Key insight: This course is conceptual and practical, not tool-specific. The principles — tokenization, diffusion, contrastive learning, multimodal reasoning — apply across all platforms and models. Understanding the “why” lets you adapt as tools evolve.