Ch 14 — Multimodal Agents

AI that can see, hear, reason, and act — computer use, robotics, and autonomous systems
High Level
visibility
Perceive
arrow_forward
psychology
Reason
arrow_forward
touch_app
Act
arrow_forward
refresh
Observe
arrow_forward
loop
Loop
arrow_forward
check_circle
Complete
-
Click play or press Space to begin...
Step- / 8
smart_toy
What Are Multimodal Agents?
AI systems that perceive, reason, and act in the world
The Agent Loop
A multimodal agent extends the standard LLM agent loop with visual perception:

1. Perceive: Take a screenshot, capture camera feed, or receive an image
2. Reason: VLM analyzes the visual input + context and decides what to do
3. Act: Execute an action (click, type, API call, move robot arm)
4. Observe: Capture the result of the action (new screenshot, sensor data)
5. Loop: Repeat until the task is complete

The key difference from text agents: multimodal agents can see the results of their actions and adapt accordingly.
Types of Multimodal Agents
// Categories of multimodal agents Computer Use Agents See screen → click/type → see result Examples: Claude Computer Use, OpenAI Operator Web Browsing Agents Navigate websites, fill forms, extract data Examples: WebVoyager, Agent-E Robotic Agents Camera input → plan → motor control Examples: RT-2, PaLM-E, Octo Real-Time Voice + Vision See + hear + respond in real-time Examples: GPT-4o voice, Project Astra
Key insight: Multimodal agents represent the convergence of VLMs and agent frameworks. Vision gives agents the ability to interact with any visual interface — no APIs needed. This is the most transformative application of multimodal AI.
computer
Computer Use Agents
AI that controls your computer by seeing the screen
How Computer Use Works
1. Screenshot capture: Take a screenshot of the current screen state
2. VLM analysis: Model identifies UI elements, reads text, understands layout
3. Action generation: Model outputs coordinates for clicks, text to type, or keyboard shortcuts
4. Execution: Automation framework performs the action
5. Verification: New screenshot confirms the action worked

The agent interacts with any application — no API integration needed. It uses the same visual interface as a human.
Current Capabilities
// What computer use agents can do (2025) ✓ Navigate websites and fill forms ✓ Use desktop applications (Excel, etc.) ✓ Transfer data between applications ✓ Follow multi-step workflows ✓ Handle basic error recovery // Limitations × Slow (2-5s per action) × Fragile with complex UIs × Can't handle CAPTCHAs reliably × Struggles with drag-and-drop × No fine motor control (pixel-precise)
Key insight: Computer use agents are the “universal API” — they can automate any software with a visual interface. But they’re currently 10–100x slower than API-based automation. Use them for tasks where no API exists, not as a replacement for existing integrations.
language
Web Browsing Agents
Autonomous navigation, data extraction, and web tasks
Architecture
Web agents combine VLMs with browser automation:

Browser engine: Playwright or Puppeteer controls a real browser
Visual input: Screenshots + accessibility tree (DOM structure)
Action space: Click, type, scroll, navigate, extract
Planning: VLM creates a step-by-step plan, then executes each step
Memory: Track visited pages, extracted data, and task progress
Use Cases
Research automation: “Find the top 10 competitors for [product] and create a comparison table”
Data collection: Extract structured data from websites that don’t have APIs
Form filling: Automate repetitive form submissions across multiple sites
Price monitoring: Track prices across e-commerce sites
Testing: Automated visual regression testing of web applications
Key insight: The combination of visual understanding (screenshots) and structural understanding (DOM/accessibility tree) makes web agents much more robust than either approach alone. The accessibility tree provides precise element targeting while screenshots provide visual context.
precision_manufacturing
Robotics & Embodied AI
VLMs controlling physical robots
Vision-Language-Action Models
VLA models extend VLMs to output physical actions:

RT-2 (Google): Trained on robot demonstrations + web data. Can follow natural language instructions: “Pick up the blue cup and place it on the shelf.”
PaLM-E: 562B parameter model that combines PaLM with robot sensor data. Reasons about physical scenes.
Octo: Open-source generalist robot policy. Works across different robot hardware.
pi0 (Physical Intelligence): Foundation model for robot manipulation tasks.
How VLAs Work
// Vision-Language-Action pipeline Input: Camera image (what the robot sees) + Language instruction ("pick up the cup") + Proprioception (joint positions) Processing: VLM encodes image + text Action head outputs motor commands (joint angles, gripper open/close) Output: Sequence of low-level motor actions Executed at 5-30 Hz control frequency // Key challenge: real-time control // VLMs are too slow for reactive control // Solution: VLM plans, small model executes
Key insight: The biggest breakthrough in robotic AI is using internet-scale visual knowledge (from VLM pre-training) for physical manipulation. A robot that has “seen” millions of images of cups knows what a cup looks like from any angle — even cups it’s never physically encountered.
record_voice_over
Real-Time Voice + Vision
Agents that see and talk simultaneously
The Real-Time Multimodal Agent
The most advanced multimodal agents process vision, audio, and text simultaneously in real-time:

GPT-4o voice mode: Sees through your phone camera, hears your voice, responds with natural speech — all in <500ms
Project Astra (Google): Continuous video understanding + voice interaction. “Where did I leave my keys?” (remembers from earlier in the video stream)
Implications: AI assistants that can see your screen, hear your questions, and help in real-time
Architecture Challenges
Latency: Must respond in <500ms for natural conversation. Requires streaming inference and speculative decoding.
Context management: Continuous video generates thousands of tokens per minute. Need efficient token compression.
Turn-taking: Know when the user is speaking vs. pausing vs. finished. Requires voice activity detection.
Memory: Remember what was seen/said earlier in the conversation without exceeding context limits.
Key insight: Real-time multimodal agents are the end-state of conversational AI. Instead of typing prompts, you show the AI what you’re looking at and talk naturally. This is the interface paradigm shift — from text-in/text-out to see-hear-speak.
security
Safety & Control
Keeping multimodal agents safe and controllable
Risk Categories
Unintended actions: Agent clicks the wrong button, sends an email, or deletes a file
Scope creep: Agent takes actions beyond what was requested
Prompt injection: Malicious content on a webpage tricks the agent into harmful actions
Data exfiltration: Agent accidentally sends sensitive data to external services
Irreversible actions: Agent makes purchases, deletes data, or sends messages that can’t be undone
Safety Patterns
Sandboxing: Run agents in isolated environments (VMs, containers)
Action allowlists: Only permit specific actions (no delete, no purchase, no send)
Human approval: Require confirmation for high-risk actions
Step limits: Cap the number of actions per task
Rollback capability: Undo recent actions if something goes wrong
Monitoring: Log every action with screenshots for audit
Key insight: Multimodal agents have the highest risk profile of any AI application because they can take real-world actions. A text hallucination is annoying; an agent hallucination that clicks “Send” on the wrong email is a disaster. Always sandbox, always require human approval for irreversible actions.
build
Building Multimodal Agents
Frameworks, tools, and practical patterns
Agent Frameworks
// Frameworks for multimodal agents Computer Use Anthropic Computer Use API OpenAI Operator Open Interpreter (open-source) Web Agents Browser Use (Python, open-source) Playwright + VLM (custom) Stagehand (TypeScript) General Agent Frameworks LangGraph (stateful agent graphs) CrewAI (multi-agent orchestration) AutoGen (Microsoft, multi-agent) Robotics ROS 2 + VLM integration LeRobot (HuggingFace)
Best Practices
Start with narrow scope: Automate one specific workflow, not “anything”
Use structured actions: Define a clear action space (click, type, scroll) not free-form
Add visual verification: After each action, verify the expected state
Implement retry logic: Actions fail — detect failure and retry with different approach
Log everything: Screenshots + actions + reasoning for debugging
Human escalation: When stuck or uncertain, ask the human
Pro tip: The most successful multimodal agents are narrow and reliable, not broad and fragile. An agent that perfectly automates one 10-step workflow is more valuable than one that attempts anything but fails 30% of the time.
school
Key Takeaways
The future of AI that can see and act
Essential Concepts
1. Perceive-Reason-Act loop: Multimodal agents see the world, reason about it, and take actions

2. Computer use: The “universal API” — automate any software through its visual interface

3. Web agents: Screenshots + DOM for robust web automation

4. VLA models: VLMs extended with action outputs for robotics

5. Safety first: Sandbox, allowlist, human approval for irreversible actions
Where We’re Headed
2025: Computer use agents for specific workflows, web research automation
2026: Real-time voice + vision assistants, reliable web agents
2027+: Embodied agents in homes and workplaces, autonomous physical tasks

Multimodal agents are the bridge between AI that knows things and AI that does things.
Next up: Chapter 15 tackles the critical topic of ethics, deepfakes, and safety in multimodal AI — the risks and responsibilities that come with AI that can generate and manipulate visual media.