Ch 1 — Beyond Text: The Multimodal Revolution

Why AI went multimodal, key milestones, and the convergence of vision, audio, and text
High Level
history
History
arrow_forward
visibility
Vision
arrow_forward
graphic_eq
Audio
arrow_forward
merge
Converge
arrow_forward
trending_up
Market
arrow_forward
rocket_launch
Future
-
Click play or press Space to begin...
Step- / 8
history
The Separate Worlds
AI modalities evolved independently for decades
The Old World
For decades, AI treated each modality as a separate discipline:

NLP: Text understanding, translation, generation
Computer Vision: Image classification, object detection
Speech: Recognition (STT), synthesis (TTS)
Audio: Music analysis, sound classification

Each had its own architectures, datasets, and research communities with almost no overlap.
Why They Were Separate
Different modalities seemed to require fundamentally different approaches. Text is sequential and discrete. Images are 2D grids of continuous pixel values. Audio is 1D waveforms at 44,100 samples per second. The mathematical tools for each were distinct — RNNs for text, CNNs for images, spectral methods for audio.
The Unification Insight
The Transformer architecture (2017) changed everything. It could process any sequence of tokens — and researchers discovered that images, audio, and video could all be converted into token sequences. One architecture to rule them all. This insight — that all modalities can be tokenized — is the foundation of the multimodal revolution.
Key insight: The breakthrough wasn’t a new modality-specific technique. It was realizing that a general-purpose sequence model (the Transformer) could handle any modality once you tokenize it properly. This is why the same architecture powers GPT-4, Stable Diffusion, and Whisper.
timeline
The Milestone Timeline
From ImageNet to Sora in 12 years
Generation 1: Foundations (2012–2019)
2012 — AlexNet: Wins ImageNet by a landslide, proves deep learning works for vision
2014 — GANs: Goodfellow invents adversarial networks, first realistic image generation
2017 — Transformer: “Attention Is All You Need” paper, the architecture that changed everything
2018 — BERT/GPT: Pre-trained language models show transfer learning works at scale
2019 — ViT proposed: Dosovitskiy shows Transformers can replace CNNs for vision
Generation 2: Multimodal Explosion (2020–2026)
2021 — CLIP & DALL-E: OpenAI connects text and images in a shared embedding space
2022 — Stable Diffusion: Open-source text-to-image generation goes viral
2022 — Whisper: Universal multilingual speech recognition
2023 — GPT-4V: LLMs that can see and reason about images
2024 — Sora: Text-to-video generation stuns the world
2025 — Gemini 2.5: Native multimodal reasoning across text, image, audio, video
Key insight: The pace is accelerating. It took 9 years from AlexNet to CLIP, but only 3 years from CLIP to Sora. Each breakthrough enables the next one faster because foundational components (Transformers, CLIP embeddings, diffusion) are reused.
merge
The Token Unification
Everything is a sequence of tokens
How Modalities Become Tokens
Text: Words → subword tokens via BPE (~50K vocabulary). “understanding” → [“under”, “standing”]
Images: Pixels → patches (16×16 or 14×14 pixel grids) → patch embeddings
Audio: Waveform → mel spectrogram → patches → tokens
Video: Frames → spacetime patches (spatial + temporal) → tokens

Once tokenized, the same Transformer architecture processes all modalities with self-attention.
Why This Matters
Token unification means:

Shared architecture: One model handles text + images + audio
Cross-modal learning: Understanding in one modality transfers to others
Emergent capabilities: Models discover connections humans didn’t teach
Scaling: More data in any modality improves all modalities
Key insight: A 1024×1024 image becomes ~765 tokens (with 14×14 patches). A 30-second audio clip becomes ~1,500 tokens. A 10-second video at 24fps becomes ~36,000 tokens. These all fit naturally into the same context window as text — the only question is budget.
trending_up
The Market Impact
$7.8B in 2025, projected $25B+ by 2028
Market Size
// Multimodal AI market growth 2023 $2.1B 2024 $4.2B (+100%) 2025 $7.8B (+86%) 2026 $12.5B (projected) 2028 $25B+ (projected) // Key segments: Image generation 35% Document understanding 25% Video generation 20% Multimodal search 12% Audio/speech 8%
Who’s Leading
OpenAI: GPT-4o (native multimodal), DALL-E 3, Sora
Google: Gemini 2.5 (native multimodal), Imagen 3, Veo 2
Anthropic: Claude 3.5 (vision), expanding multimodal
Meta: LLaMA 3 + LLaVA (open-source), Make-A-Video
Stability AI: Stable Diffusion, Stable Audio
Midjourney: Leading creative image generation
ElevenLabs: Voice synthesis and cloning
Key insight: Every major AI lab has gone multimodal-first. Text-only models are becoming a legacy category. The competitive moat is shifting from “best text model” to “best multimodal reasoning across all modalities.”
psychology
Why Multimodal Matters
Humans are multimodal — AI should be too
The Human Analogy
Humans process multiple modalities simultaneously: we see a face, hear a voice, read body language, and understand words — all at once. This multimodal integration is fundamental to how we understand the world. AI that only processes text is like a person who can only read — functional but profoundly limited.
Emergent Capabilities
Multimodal models develop capabilities that no single-modality model has:

Visual reasoning: “What’s wrong with this circuit diagram?”
Cross-modal search: Find images matching a text description
Document understanding: Extract data from invoices, charts, handwriting
Spatial reasoning: Understand 3D scenes from 2D images
The Practical Impact
Multimodal AI unlocks applications that were impossible with text-only models:

• Medical imaging analysis + report generation
• Autonomous driving (vision + spatial reasoning)
• Video content moderation at scale
• Accessibility (image descriptions, audio transcription)
• Creative tools (text-to-image, text-to-video)
• Robotics (see, plan, act in physical world)
Key insight: The most valuable AI applications require understanding multiple modalities. Text-only AI is a stepping stone, not the destination. The $25B+ market is driven by use cases that fundamentally require seeing, hearing, or creating beyond text.
category
Three Waves of Multimodal AI
Generation, understanding, and integration
Wave 1: Generation (2021–2023)
AI learns to create images, audio, and video from text descriptions. DALL-E, Stable Diffusion, Midjourney, and ElevenLabs. The focus is on output quality, diversity, and creative control. Users write prompts; AI generates media.
Wave 2: Understanding (2023–2025)
AI learns to perceive and reason about images, video, and audio. GPT-4V, Gemini, and Claude with vision. The focus is on comprehension, analysis, and question-answering about visual content. Users upload images; AI explains them.
Wave 3: Integration (2025+)
AI seamlessly combines generation and understanding across modalities in a single model. Native multimodal models that can see an image, reason about it, generate a modified version, describe the changes, and explain the reasoning — all in one flow. GPT-4o and Gemini 2.5 are early examples.
Key insight: We’re in the transition from Wave 2 to Wave 3. The most exciting capabilities emerge when generation and understanding are unified. A model that can both see and create can iterate on designs, debug visual problems, and reason about the physical world.
timeline
The Modality Timeline
Each wave builds on the previous
Modality Adoption
2020 Text (GPT-3) 2021 Text + Images (CLIP, DALL-E) 2022 + Audio (Whisper, Stable Diffusion) 2023 + Vision reasoning (GPT-4V, Gemini) 2024 + Video (Sora, Runway Gen-3) 2025 + 3D / Spatial (emerging) 2026 + Real-time native (unified models)
What’s Next
The frontier is moving toward:

Real-time video generation: Interactive, not batch — generate video as fast as you can watch it
3D and spatial: Understanding and generating 3D scenes from text or images
Embodied AI: Robots that see, hear, plan, and act in the physical world
Universal models: One model for all modalities, all tasks, all contexts
Key insight: Each new modality doesn’t replace the previous ones — it layers on top. The most capable systems will handle all modalities natively. The question isn’t “which modality?” but “how many modalities can we unify?”
map
What This Course Covers
Your roadmap for 17 chapters across 5 sections
Foundations (Ch 1–4)
How machines see (CNNs to Vision Transformers), the generative model family tree (VAEs, GANs, Diffusion), contrastive learning & CLIP — the backbone that connects text and images.
Generation (Ch 5–8)
How diffusion models work (forward/reverse process, U-Net, CFG), text-to-image (Stable Diffusion, DALL-E, Midjourney), text-to-video (Sora, DiT), and speech & audio AI (Whisper, ElevenLabs, music generation).
Understanding & Applications (Ch 9–14)
Vision-language models (GPT-4V, Gemini, LLaVA), the model landscape, multimodal embeddings & search, training multimodal models, building multimodal apps, and multimodal agents.
Responsibility & Future (Ch 15–17)
Ethics and deepfakes (C2PA, watermarking), evaluating multimodal models (FID, CLIP score), and the future — world models, embodied AI, and universal multimodal systems.
Key insight: This course is conceptual and practical, not tool-specific. The principles — tokenization, diffusion, contrastive learning, multimodal reasoning — apply across all platforms and models. Understanding the “why” lets you adapt as tools evolve.