Ch 17 — The Future of Multimodal AI

World models, embodied AI, universal interfaces, and what’s coming next
High Level
today
Now
arrow_forward
public
World
arrow_forward
smart_toy
Embody
arrow_forward
devices
Interface
arrow_forward
science
Science
arrow_forward
rocket_launch
Beyond
-
Click play or press Space to begin...
Step- / 8
today
Where We Are Today
The state of multimodal AI in 2025–2026
Current Capabilities
Vision-language models: GPT-4o, Gemini 2.5, Claude 3.5 understand images at near-human level for many tasks
Image generation: Photorealistic images from text in seconds (DALL-E 3, Midjourney, Stable Diffusion)
Video generation: Short clips (5–60s) with improving temporal coherence (Sora, Runway, Kling)
Voice AI: Real-time voice with emotion and personality (GPT-4o voice mode)
Multimodal agents: Early computer use and web browsing agents (Claude Computer Use, Operator)
Current Limitations
Hallucination: Models still confidently describe things that aren’t there
Video length: Generating coherent video beyond 60 seconds remains difficult
3D understanding: Models understand 2D images but struggle with true 3D reasoning
Physical reasoning: Understanding physics, causality, and object permanence
Real-time: Most models are too slow for real-time video processing
Cost: Multimodal inference is 10–100x more expensive than text-only
Key insight: We’re at an inflection point. Multimodal AI has crossed the “good enough” threshold for many applications but hasn’t yet reached the reliability needed for high-stakes autonomous use. The next 2–3 years will close this gap.
public
World Models
AI that understands how the physical world works
What Are World Models?
World models are AI systems that build an internal simulation of the physical world:

Physics understanding: Objects fall, liquids flow, collisions have consequences
Object permanence: Things exist even when not visible
Causality: Pushing a cup causes it to move; dropping it causes it to fall
Prediction: Given the current state, predict what happens next

Video generation models (Sora) are early world models — they must simulate physics to generate coherent video.
Why World Models Matter
// World models unlock: Robotics Plan actions by simulating outcomes "If I push this, what happens?" Autonomous vehicles Predict other drivers' behavior Simulate rare scenarios for training Scientific simulation Protein folding, weather, materials Faster than physics-based simulation Game/film production Generate entire 3D worlds from text Real-time interactive environments Planning & reasoning "What would happen if...?" Mental simulation for decision-making
Key insight: World models are the bridge between perception and action. Current VLMs can see the world; world models will let AI understand how the world works. This is the key missing piece for truly capable autonomous systems.
smart_toy
Embodied AI & Robotics
AI that exists in and interacts with the physical world
The Convergence
Three technologies are converging to create embodied AI:

1. Foundation models: VLMs provide visual understanding and language reasoning
2. World models: Physical simulation enables planning and prediction
3. Robot hardware: Humanoid robots (Figure, Tesla Optimus, 1X) provide the physical platform

The result: robots that can understand natural language instructions, see their environment, reason about physics, and execute complex manipulation tasks.
Timeline
2025–2026: Robots in controlled environments (warehouses, factories). Simple pick-and-place, navigation.
2027–2028: Robots in semi-structured environments (restaurants, hospitals). Multi-step tasks with error recovery.
2029–2030: Robots in homes. General-purpose household assistance. The “iPhone moment” for robotics.
2030+: Ubiquitous embodied AI. Robots as common as smartphones.
Key insight: The bottleneck for embodied AI is no longer perception or reasoning — it’s reliable physical manipulation. A robot can understand “fold the laundry” but can’t yet reliably fold a shirt. Solving dexterous manipulation is the next frontier.
devices
The Universal Interface
How multimodal AI changes human-computer interaction
Interface Evolution
// The evolution of human-computer interaction 1970s Command line (type commands) 1984 GUI (point and click) 2007 Touch (tap and swipe) 2011 Voice (Siri, Alexa) 2023 Chat (ChatGPT) 2025 Multimodal (see + hear + speak) 2027? Ambient (always-on, contextual) // Each shift: more natural, less friction // Multimodal = communicate like you do // with another human
What This Looks Like
Point your phone at a broken appliance: AI diagnoses the problem and walks you through the fix
Show your fridge contents: AI suggests recipes and generates a shopping list
Wear smart glasses: Real-time translation of signs, menus, and conversations
Describe a room: AI generates a 3D interior design you can walk through in VR
Sketch on a napkin: AI turns your rough sketch into a polished design or working prototype
Key insight: The ultimate interface is no interface. Multimodal AI enables interaction through natural human modalities — pointing, speaking, showing, gesturing. The keyboard and mouse become optional, not required.
science
Scientific Discovery
Multimodal AI accelerating research and discovery
Current Impact
Drug discovery: AlphaFold predicted 200M+ protein structures. Next: multimodal models that understand protein function from structure + sequence + literature
Materials science: Generate and evaluate new materials by understanding crystal structures, properties, and synthesis conditions
Climate science: Analyze satellite imagery + weather data + climate models for better predictions
Medical imaging: AI radiologists that combine image analysis with patient history and medical literature
The AI Scientist
The future AI scientist will be multimodal:

Read: Process millions of papers, patents, and datasets
See: Analyze microscopy images, satellite data, experimental results
Reason: Form hypotheses by connecting observations across modalities
Simulate: Use world models to predict experimental outcomes
Design: Generate new experiments, molecules, materials
Communicate: Explain findings in natural language with visualizations
Key insight: Scientific breakthroughs increasingly happen at the intersection of modalities — connecting a pattern in microscopy images with a trend in genomic data with a prediction from physics simulation. Multimodal AI is uniquely suited to find these cross-modal connections.
movie
Creative Industries
How multimodal AI transforms media, entertainment, and art
Near-Term (2025–2027)
Film pre-visualization: Generate storyboards and previsualization from scripts
Game asset generation: Create textures, characters, environments from descriptions
Music production: Generate backing tracks, sound effects, and arrangements
Advertising: Personalized video ads generated for each viewer
Education: Interactive visual explanations generated on-demand
Longer-Term (2027+)
Interactive storytelling: AI-generated movies that respond to viewer choices in real-time
Virtual worlds: Generate entire explorable 3D environments from text descriptions
Digital humans: Photorealistic AI characters for customer service, education, entertainment
Personalized content: Every piece of media customized to individual preferences
New art forms: Creative mediums that don’t exist yet, enabled by multimodal generation
Key insight: Multimodal AI won’t replace human creativity — it will amplify it. A single person with AI tools will be able to produce content that previously required a team of 50. The bottleneck shifts from execution to vision and taste.
rocket_launch
Preparing for What’s Coming
How to stay ahead of the multimodal revolution
For Engineers
Build multimodal-first: Design systems that handle text, images, audio, and video from the start
Master eval: The ability to measure multimodal quality is the most valuable skill
Learn fine-tuning: Domain-specific multimodal models will be the competitive moat
Understand embeddings: Shared embedding spaces are the foundation of multimodal search and retrieval
Stay flexible: Abstract model calls — the best model changes every 6 months
For Organizations
Audit visual data: What images, videos, and documents does your organization have? This is your multimodal AI fuel.
Start with high-ROI use cases: Document processing, visual inspection, content generation
Build the data flywheel: Log, evaluate, improve — the cycle that compounds
Invest in safety: Responsible AI practices are a competitive advantage, not a cost
Hire cross-disciplinary: The best multimodal teams combine ML, vision, audio, and product skills
Key insight: The organizations that win in the multimodal era will be those that treat their visual and audio data as first-class assets, build robust eval pipelines, and invest in domain-specific fine-tuning. Generic API calls are table stakes — differentiation comes from your data and your eval.
school
Course Summary
What we’ve learned across 17 chapters
The Big Picture
1. Foundations: Vision Transformers, contrastive learning (CLIP), diffusion models — the building blocks

2. Generation: Text-to-image (Stable Diffusion, DALL-E), text-to-video (Sora), speech & audio (Whisper, ElevenLabs)

3. Understanding: VLMs (GPT-4V, Gemini), multimodal embeddings, cross-modal search

4. Application: Training, building apps, multimodal agents, multimodal RAG

5. Responsibility: Ethics, deepfakes, safety, evaluation, and the future
The One Thing to Remember
Multimodal AI is not just “LLMs + images.” It’s a fundamental shift in how AI systems perceive, understand, and interact with the world.

The models that will define the next decade are not text-only — they see, hear, speak, and act. The applications that will create the most value are those that leverage multiple modalities together, not in isolation.

The future belongs to those who build multimodal-first.
Congratulations! You’ve completed the Multimodal AI & Generative Media course. You now have a comprehensive understanding of how multimodal AI works, from foundational concepts to production deployment to the future of the field.