palette
Multimodal AI & Generative Media
From pixels to video — how AI sees, creates, and understands images, video, audio, and beyond.
Co-Created by Kiran Shirol and Claude
Topics
Diffusion Models
Vision-Language
CLIP & Embeddings
Text-to-Video
Generative Media
home
Learning Portal
play_arrow
Start Learning
dictionary
Glossary
summarize
Key Insights
17 chapters
· 5 sections
Section 1
Foundations — The Building Blocks
Why AI went multimodal, how machines see, generative model history, and contrastive learning.
1
auto_awesome
Beyond Text: The Multimodal Revolution
Why AI went multimodal, key milestones from ImageNet to Sora, and the convergence of vision, audio, and text.
arrow_forward
Learn
2
visibility
How Machines See
Pixels, CNNs, Vision Transformers, image patching — how a 1024×1024 image becomes 765 tokens.
arrow_forward
Learn
3
account_tree
The Generative Model Family Tree
VAEs, GANs, Normalizing Flows, Diffusion — how each generates, and why diffusion won.
arrow_forward
Learn
4
link
Contrastive Learning & CLIP
How CLIP connects images and text, contrastive loss, LAION-5B, and zero-shot classification.
arrow_forward
Learn
Section 2
Generation — Creating Images, Video & Audio
Diffusion models, text-to-image, text-to-video, and speech & audio AI.
5
blur_on
How Diffusion Models Work
Forward noise, reverse denoising, U-Net, latent diffusion, classifier-free guidance — the math made intuitive.
arrow_forward
Learn
6
image
Text-to-Image Generation
Stable Diffusion, DALL-E 3, Midjourney, Flux — ControlNet, inpainting, and the creative workflow.
arrow_forward
Learn
7
movie
Text-to-Video & Motion
Sora’s DiT architecture, spacetime patches, temporal coherence, and current capabilities vs hard limits.
arrow_forward
Learn
8
graphic_eq
Speech & Audio AI
Whisper, ElevenLabs, music generation, audio tokenization, and real-time voice agents.
arrow_forward
Learn
Section 3
Understanding — Perceiving & Connecting Modalities
Vision-language models, the model landscape, multimodal embeddings, and training.
9
psychology
How Vision-Language Models Work
The VLM architecture: vision encoder + projector + LLM. GPT-4V, Gemini, LLaVA, and visual QA.
arrow_forward
Learn
10
landscape
The Multimodal Model Landscape
GPT-4o vs Gemini vs Claude vs open-source — native vs bolt-on, pricing, and choosing the right model.
arrow_forward
Learn
11
hub
Multimodal Embeddings & Search
Shared vector spaces, CLIP embeddings, cross-modal retrieval, and building image+text RAG.
arrow_forward
Learn
12
model_training
Training Multimodal Models
Pre-training on image-text pairs, LAION/DataComp, instruction tuning, RLHF, and compute requirements.
arrow_forward
Learn
Section 4
Applications — Putting It All Together
Building multimodal apps, document understanding, and multimodal agents.
13
apps
Building Multimodal Applications
Document understanding, visual QA, video summarization, image+text RAG, and practical architectures.
arrow_forward
Learn
14
smart_toy
Multimodal Agents
Agents that see, hear, and act — computer use, voice interfaces, web agents, and real-world deployments.
arrow_forward
Learn
Section 5
Responsibility & Future
Ethics, deepfakes, evaluation, and where multimodal AI is heading next.
15
gavel
Ethics, Deepfakes & Safety
Deepfake detection, C2PA content authenticity, bias in image generation, copyright, and watermarking.
arrow_forward
Learn
16
score
Evaluating Multimodal Models
FID, CLIP score, human preference ELO, video quality metrics, VQA accuracy, and the evaluation gap.
arrow_forward
Learn
17
rocket_launch
The Future of Multimodal AI
World models, real-time video generation, embodied AI, universal multimodal models, and the path forward.
arrow_forward
Learn
explore
Explore Related Courses
token
How LLMs Work
Transformers & Attention
model_training
Fine-Tuning
Adapting Models to Your Data
monitoring
LLM Evaluation
Observability & Metrics
psychology
Agentic AI
Planning, Memory & Tool Use