Ch 1: Beyond Text — The Multimodal Revolution

history

The Separate Worlds

AI modalities evolved independently for decades

The Old World

For decades, AI treated each modality as a separate discipline:

• NLP: Text understanding, translation, generation
• Computer Vision: Image classification, object detection
• Speech: Recognition (STT), synthesis (TTS)
• Audio: Music analysis, sound classification

Each had its own architectures, datasets, and research communities with almost no overlap.

Why They Were Separate

Different modalities seemed to require fundamentally different approaches. Text is sequential and discrete. Images are 2D grids of continuous pixel values. Audio is 1D waveforms at 44,100 samples per second. The mathematical tools for each were distinct — RNNs for text, CNNs for images, spectral methods for audio.

The Unification Insight

The Transformer architecture (2017) changed everything. It could process any sequence of tokens — and researchers discovered that images, audio, and video could all be converted into token sequences. One architecture to rule them all. This insight — that all modalities can be tokenized — is the foundation of the multimodal revolution.

Key insight: The breakthrough wasn’t a new modality-specific technique. It was realizing that a general-purpose sequence model (the Transformer) could handle any modality once you tokenize it properly. This is why the same architecture powers GPT-4, Stable Diffusion, and Whisper.

timeline

The Milestone Timeline

From ImageNet to Sora in 12 years

Generation 1: Foundations (2012–2019)

• 2012 — AlexNet: Wins ImageNet by a landslide, proves deep learning works for vision
• 2014 — GANs: Goodfellow invents adversarial networks, first realistic image generation
• 2017 — Transformer: “Attention Is All You Need” paper, the architecture that changed everything
• 2018 — BERT/GPT: Pre-trained language models show transfer learning works at scale
• 2019 — ViT proposed: Dosovitskiy shows Transformers can replace CNNs for vision

Generation 2: Multimodal Explosion (2020–2026)

• 2021 — CLIP & DALL-E: OpenAI connects text and images in a shared embedding space
• 2022 — Stable Diffusion: Open-source text-to-image generation goes viral
• 2022 — Whisper: Universal multilingual speech recognition
• 2023 — GPT-4V: LLMs that can see and reason about images
• 2024 — Sora: Text-to-video generation stuns the world
• 2025 — Gemini 2.5: Native multimodal reasoning across text, image, audio, video

Key insight: The pace is accelerating. It took 9 years from AlexNet to CLIP, but only 3 years from CLIP to Sora. Each breakthrough enables the next one faster because foundational components (Transformers, CLIP embeddings, diffusion) are reused.

merge

The Token Unification

Everything is a sequence of tokens

How Modalities Become Tokens

• Text: Words → subword tokens via BPE (~50K vocabulary). “understanding” → [“under”, “standing”]
• Images: Pixels → patches (16×16 or 14×14 pixel grids) → patch embeddings
• Audio: Waveform → mel spectrogram → patches → tokens
• Video: Frames → spacetime patches (spatial + temporal) → tokens

Once tokenized, the same Transformer architecture processes all modalities with self-attention.

Why This Matters

Token unification means:

• Shared architecture: One model handles text + images + audio
• Cross-modal learning: Understanding in one modality transfers to others
• Emergent capabilities: Models discover connections humans didn’t teach
• Scaling: More data in any modality improves all modalities

Key insight: A 1024×1024 image becomes ~765 tokens (with 14×14 patches). A 30-second audio clip becomes ~1,500 tokens. A 10-second video at 24fps becomes ~36,000 tokens. These all fit naturally into the same context window as text — the only question is budget.

trending_up

The Market Impact

$7.8B in 2025, projected $25B+ by 2028

Market Size

// Multimodal AI market growth 2023 $2.1B 2024 $4.2B (+100%) 2025 $7.8B (+86%) 2026 $12.5B (projected) 2028 $25B+ (projected) // Key segments: Image generation 35% Document understanding 25% Video generation 20% Multimodal search 12% Audio/speech 8%

Who’s Leading

• OpenAI: GPT-4o (native multimodal), DALL-E 3, Sora
• Google: Gemini 2.5 (native multimodal), Imagen 3, Veo 2
• Anthropic: Claude 3.5 (vision), expanding multimodal
• Meta: LLaMA 3 + LLaVA (open-source), Make-A-Video
• Stability AI: Stable Diffusion, Stable Audio
• Midjourney: Leading creative image generation
• ElevenLabs: Voice synthesis and cloning

Key insight: Every major AI lab has gone multimodal-first. Text-only models are becoming a legacy category. The competitive moat is shifting from “best text model” to “best multimodal reasoning across all modalities.”

psychology

Why Multimodal Matters

Humans are multimodal — AI should be too

The Human Analogy

Humans process multiple modalities simultaneously: we see a face, hear a voice, read body language, and understand words — all at once. This multimodal integration is fundamental to how we understand the world. AI that only processes text is like a person who can only read — functional but profoundly limited.

Emergent Capabilities

Multimodal models develop capabilities that no single-modality model has:

• Visual reasoning: “What’s wrong with this circuit diagram?”
• Cross-modal search: Find images matching a text description
• Document understanding: Extract data from invoices, charts, handwriting
• Spatial reasoning: Understand 3D scenes from 2D images

The Practical Impact

Multimodal AI unlocks applications that were impossible with text-only models:

• Medical imaging analysis + report generation
• Autonomous driving (vision + spatial reasoning)
• Video content moderation at scale
• Accessibility (image descriptions, audio transcription)
• Creative tools (text-to-image, text-to-video)
• Robotics (see, plan, act in physical world)

Key insight: The most valuable AI applications require understanding multiple modalities. Text-only AI is a stepping stone, not the destination. The $25B+ market is driven by use cases that fundamentally require seeing, hearing, or creating beyond text.

Ch 1 — Beyond Text: The Multimodal Revolution