Ch 17: The Future of Multimodal AI

Ch 17 — The Future of Multimodal AI

World models, embodied AI, universal interfaces, and what’s coming next

Index

High Level

today

Now

arrow_forward

public

World

arrow_forward

smart_toy

Embody

arrow_forward

devices

Interface

arrow_forward

science

Science

arrow_forward

rocket_launch

Beyond

Click play or press Space to begin...

Step- / 8

today

Where We Are Today

The state of multimodal AI in 2025–2026

Current Capabilities

• Vision-language models: GPT-4o, Gemini 2.5, Claude 3.5 understand images at near-human level for many tasks
• Image generation: Photorealistic images from text in seconds (DALL-E 3, Midjourney, Stable Diffusion)
• Video generation: Short clips (5–60s) with improving temporal coherence (Sora, Runway, Kling)
• Voice AI: Real-time voice with emotion and personality (GPT-4o voice mode)
• Multimodal agents: Early computer use and web browsing agents (Claude Computer Use, Operator)

Current Limitations

• Hallucination: Models still confidently describe things that aren’t there
• Video length: Generating coherent video beyond 60 seconds remains difficult
• 3D understanding: Models understand 2D images but struggle with true 3D reasoning
• Physical reasoning: Understanding physics, causality, and object permanence
• Real-time: Most models are too slow for real-time video processing
• Cost: Multimodal inference is 10–100x more expensive than text-only

Key insight: We’re at an inflection point. Multimodal AI has crossed the “good enough” threshold for many applications but hasn’t yet reached the reliability needed for high-stakes autonomous use. The next 2–3 years will close this gap.

public

World Models

AI that understands how the physical world works

What Are World Models?

World models are AI systems that build an internal simulation of the physical world:

• Physics understanding: Objects fall, liquids flow, collisions have consequences
• Object permanence: Things exist even when not visible
• Causality: Pushing a cup causes it to move; dropping it causes it to fall
• Prediction: Given the current state, predict what happens next

Video generation models (Sora) are early world models — they must simulate physics to generate coherent video.

Why World Models Matter

// World models unlock: Robotics Plan actions by simulating outcomes "If I push this, what happens?" Autonomous vehicles Predict other drivers' behavior Simulate rare scenarios for training Scientific simulation Protein folding, weather, materials Faster than physics-based simulation Game/film production Generate entire 3D worlds from text Real-time interactive environments Planning & reasoning "What would happen if...?" Mental simulation for decision-making

Key insight: World models are the bridge between perception and action. Current VLMs can see the world; world models will let AI understand how the world works. This is the key missing piece for truly capable autonomous systems.

smart_toy

Embodied AI & Robotics

AI that exists in and interacts with the physical world

The Convergence

Three technologies are converging to create embodied AI:

1. Foundation models: VLMs provide visual understanding and language reasoning
2. World models: Physical simulation enables planning and prediction
3. Robot hardware: Humanoid robots (Figure, Tesla Optimus, 1X) provide the physical platform

The result: robots that can understand natural language instructions, see their environment, reason about physics, and execute complex manipulation tasks.

Timeline

• 2025–2026: Robots in controlled environments (warehouses, factories). Simple pick-and-place, navigation.
• 2027–2028: Robots in semi-structured environments (restaurants, hospitals). Multi-step tasks with error recovery.
• 2029–2030: Robots in homes. General-purpose household assistance. The “iPhone moment” for robotics.
• 2030+: Ubiquitous embodied AI. Robots as common as smartphones.

Key insight: The bottleneck for embodied AI is no longer perception or reasoning — it’s reliable physical manipulation. A robot can understand “fold the laundry” but can’t yet reliably fold a shirt. Solving dexterous manipulation is the next frontier.

devices

The Universal Interface

How multimodal AI changes human-computer interaction

Interface Evolution

// The evolution of human-computer interaction 1970s Command line (type commands) 1984 GUI (point and click) 2007 Touch (tap and swipe) 2011 Voice (Siri, Alexa) 2023 Chat (ChatGPT) 2025 Multimodal (see + hear + speak) 2027? Ambient (always-on, contextual) // Each shift: more natural, less friction // Multimodal = communicate like you do // with another human

What This Looks Like

• Point your phone at a broken appliance: AI diagnoses the problem and walks you through the fix
• Show your fridge contents: AI suggests recipes and generates a shopping list
• Wear smart glasses: Real-time translation of signs, menus, and conversations
• Describe a room: AI generates a 3D interior design you can walk through in VR
• Sketch on a napkin: AI turns your rough sketch into a polished design or working prototype

Key insight: The ultimate interface is no interface. Multimodal AI enables interaction through natural human modalities — pointing, speaking, showing, gesturing. The keyboard and mouse become optional, not required.

science

Scientific Discovery

Multimodal AI accelerating research and discovery

Current Impact

• Drug discovery: AlphaFold predicted 200M+ protein structures. Next: multimodal models that understand protein function from structure + sequence + literature
• Materials science: Generate and evaluate new materials by understanding crystal structures, properties, and synthesis conditions
• Climate science: Analyze satellite imagery + weather data + climate models for better predictions
• Medical imaging: AI radiologists that combine image analysis with patient history and medical literature

The AI Scientist

The future AI scientist will be multimodal:

• Read: Process millions of papers, patents, and datasets
• See: Analyze microscopy images, satellite data, experimental results
• Reason: Form hypotheses by connecting observations across modalities
• Simulate: Use world models to predict experimental outcomes
• Design: Generate new experiments, molecules, materials
• Communicate: Explain findings in natural language with visualizations

Key insight: Scientific breakthroughs increasingly happen at the intersection of modalities — connecting a pattern in microscopy images with a trend in genomic data with a prediction from physics simulation. Multimodal AI is uniquely suited to find these cross-modal connections.

movie

Creative Industries

How multimodal AI transforms media, entertainment, and art

Near-Term (2025–2027)

• Film pre-visualization: Generate storyboards and previsualization from scripts
• Game asset generation: Create textures, characters, environments from descriptions
• Music production: Generate backing tracks, sound effects, and arrangements
• Advertising: Personalized video ads generated for each viewer
• Education: Interactive visual explanations generated on-demand

Longer-Term (2027+)

• Interactive storytelling: AI-generated movies that respond to viewer choices in real-time
• Virtual worlds: Generate entire explorable 3D environments from text descriptions
• Digital humans: Photorealistic AI characters for customer service, education, entertainment
• Personalized content: Every piece of media customized to individual preferences
• New art forms: Creative mediums that don’t exist yet, enabled by multimodal generation

Key insight: Multimodal AI won’t replace human creativity — it will amplify it. A single person with AI tools will be able to produce content that previously required a team of 50. The bottleneck shifts from execution to vision and taste.

rocket_launch

Preparing for What’s Coming

How to stay ahead of the multimodal revolution

For Engineers

• Build multimodal-first: Design systems that handle text, images, audio, and video from the start
• Master eval: The ability to measure multimodal quality is the most valuable skill
• Learn fine-tuning: Domain-specific multimodal models will be the competitive moat
• Understand embeddings: Shared embedding spaces are the foundation of multimodal search and retrieval
• Stay flexible: Abstract model calls — the best model changes every 6 months

For Organizations

• Audit visual data: What images, videos, and documents does your organization have? This is your multimodal AI fuel.
• Start with high-ROI use cases: Document processing, visual inspection, content generation
• Build the data flywheel: Log, evaluate, improve — the cycle that compounds
• Invest in safety: Responsible AI practices are a competitive advantage, not a cost
• Hire cross-disciplinary: The best multimodal teams combine ML, vision, audio, and product skills

Key insight: The organizations that win in the multimodal era will be those that treat their visual and audio data as first-class assets, build robust eval pipelines, and invest in domain-specific fine-tuning. Generic API calls are table stakes — differentiation comes from your data and your eval.

school

Course Summary

What we’ve learned across 17 chapters

The Big Picture

1. Foundations: Vision Transformers, contrastive learning (CLIP), diffusion models — the building blocks

2. Generation: Text-to-image (Stable Diffusion, DALL-E), text-to-video (Sora), speech & audio (Whisper, ElevenLabs)

3. Understanding: VLMs (GPT-4V, Gemini), multimodal embeddings, cross-modal search

4. Application: Training, building apps, multimodal agents, multimodal RAG

5. Responsibility: Ethics, deepfakes, safety, evaluation, and the future

The One Thing to Remember

Multimodal AI is not just “LLMs + images.” It’s a fundamental shift in how AI systems perceive, understand, and interact with the world.

The models that will define the next decade are not text-only — they see, hear, speak, and act. The applications that will create the most value are those that leverage multiple modalities together, not in isolation.

The future belongs to those who build multimodal-first.

Congratulations! You’ve completed the Multimodal AI & Generative Media course. You now have a comprehensive understanding of how multimodal AI works, from foundational concepts to production deployment to the future of the field.

arrow_back Ch 16: Evaluation Back to Course Index arrow_forward