Ch 14: Multimodal Agents

Ch 14 — Multimodal Agents

AI that can see, hear, reason, and act — computer use, robotics, and autonomous systems

Index

High Level

visibility

Perceive

arrow_forward

psychology

Reason

arrow_forward

touch_app

Act

arrow_forward

refresh

Observe

arrow_forward

loop

Loop

arrow_forward

check_circle

Complete

Click play or press Space to begin...

Step- / 8

smart_toy

What Are Multimodal Agents?

AI systems that perceive, reason, and act in the world

The Agent Loop

A multimodal agent extends the standard LLM agent loop with visual perception:

1. Perceive: Take a screenshot, capture camera feed, or receive an image
2. Reason: VLM analyzes the visual input + context and decides what to do
3. Act: Execute an action (click, type, API call, move robot arm)
4. Observe: Capture the result of the action (new screenshot, sensor data)
5. Loop: Repeat until the task is complete

The key difference from text agents: multimodal agents can see the results of their actions and adapt accordingly.

Types of Multimodal Agents

// Categories of multimodal agents Computer Use Agents See screen → click/type → see result Examples: Claude Computer Use, OpenAI Operator Web Browsing Agents Navigate websites, fill forms, extract data Examples: WebVoyager, Agent-E Robotic Agents Camera input → plan → motor control Examples: RT-2, PaLM-E, Octo Real-Time Voice + Vision See + hear + respond in real-time Examples: GPT-4o voice, Project Astra

Key insight: Multimodal agents represent the convergence of VLMs and agent frameworks. Vision gives agents the ability to interact with any visual interface — no APIs needed. This is the most transformative application of multimodal AI.

computer

Computer Use Agents

AI that controls your computer by seeing the screen

How Computer Use Works

1. Screenshot capture: Take a screenshot of the current screen state
2. VLM analysis: Model identifies UI elements, reads text, understands layout
3. Action generation: Model outputs coordinates for clicks, text to type, or keyboard shortcuts
4. Execution: Automation framework performs the action
5. Verification: New screenshot confirms the action worked

The agent interacts with any application — no API integration needed. It uses the same visual interface as a human.

Current Capabilities

// What computer use agents can do (2025) ✓ Navigate websites and fill forms ✓ Use desktop applications (Excel, etc.) ✓ Transfer data between applications ✓ Follow multi-step workflows ✓ Handle basic error recovery // Limitations × Slow (2-5s per action) × Fragile with complex UIs × Can't handle CAPTCHAs reliably × Struggles with drag-and-drop × No fine motor control (pixel-precise)

Key insight: Computer use agents are the “universal API” — they can automate any software with a visual interface. But they’re currently 10–100x slower than API-based automation. Use them for tasks where no API exists, not as a replacement for existing integrations.

language

Web Browsing Agents

Autonomous navigation, data extraction, and web tasks

Architecture

Web agents combine VLMs with browser automation:

• Browser engine: Playwright or Puppeteer controls a real browser
• Visual input: Screenshots + accessibility tree (DOM structure)
• Action space: Click, type, scroll, navigate, extract
• Planning: VLM creates a step-by-step plan, then executes each step
• Memory: Track visited pages, extracted data, and task progress

Use Cases

• Research automation: “Find the top 10 competitors for [product] and create a comparison table”
• Data collection: Extract structured data from websites that don’t have APIs
• Form filling: Automate repetitive form submissions across multiple sites
• Price monitoring: Track prices across e-commerce sites
• Testing: Automated visual regression testing of web applications

Key insight: The combination of visual understanding (screenshots) and structural understanding (DOM/accessibility tree) makes web agents much more robust than either approach alone. The accessibility tree provides precise element targeting while screenshots provide visual context.

precision_manufacturing

Robotics & Embodied AI

VLMs controlling physical robots

Vision-Language-Action Models

VLA models extend VLMs to output physical actions:

• RT-2 (Google): Trained on robot demonstrations + web data. Can follow natural language instructions: “Pick up the blue cup and place it on the shelf.”
• PaLM-E: 562B parameter model that combines PaLM with robot sensor data. Reasons about physical scenes.
• Octo: Open-source generalist robot policy. Works across different robot hardware.
• pi0 (Physical Intelligence): Foundation model for robot manipulation tasks.

How VLAs Work

// Vision-Language-Action pipeline Input: Camera image (what the robot sees) + Language instruction ("pick up the cup") + Proprioception (joint positions) Processing: VLM encodes image + text Action head outputs motor commands (joint angles, gripper open/close) Output: Sequence of low-level motor actions Executed at 5-30 Hz control frequency // Key challenge: real-time control // VLMs are too slow for reactive control // Solution: VLM plans, small model executes

Key insight: The biggest breakthrough in robotic AI is using internet-scale visual knowledge (from VLM pre-training) for physical manipulation. A robot that has “seen” millions of images of cups knows what a cup looks like from any angle — even cups it’s never physically encountered.

record_voice_over

Real-Time Voice + Vision

Agents that see and talk simultaneously

The Real-Time Multimodal Agent

The most advanced multimodal agents process vision, audio, and text simultaneously in real-time:

• GPT-4o voice mode: Sees through your phone camera, hears your voice, responds with natural speech — all in <500ms
• Project Astra (Google): Continuous video understanding + voice interaction. “Where did I leave my keys?” (remembers from earlier in the video stream)
• Implications: AI assistants that can see your screen, hear your questions, and help in real-time

Architecture Challenges

• Latency: Must respond in <500ms for natural conversation. Requires streaming inference and speculative decoding.
• Context management: Continuous video generates thousands of tokens per minute. Need efficient token compression.
• Turn-taking: Know when the user is speaking vs. pausing vs. finished. Requires voice activity detection.
• Memory: Remember what was seen/said earlier in the conversation without exceeding context limits.

Key insight: Real-time multimodal agents are the end-state of conversational AI. Instead of typing prompts, you show the AI what you’re looking at and talk naturally. This is the interface paradigm shift — from text-in/text-out to see-hear-speak.

security

Safety & Control

Keeping multimodal agents safe and controllable

Risk Categories

• Unintended actions: Agent clicks the wrong button, sends an email, or deletes a file
• Scope creep: Agent takes actions beyond what was requested
• Prompt injection: Malicious content on a webpage tricks the agent into harmful actions
• Data exfiltration: Agent accidentally sends sensitive data to external services
• Irreversible actions: Agent makes purchases, deletes data, or sends messages that can’t be undone

Safety Patterns

• Sandboxing: Run agents in isolated environments (VMs, containers)
• Action allowlists: Only permit specific actions (no delete, no purchase, no send)
• Human approval: Require confirmation for high-risk actions
• Step limits: Cap the number of actions per task
• Rollback capability: Undo recent actions if something goes wrong
• Monitoring: Log every action with screenshots for audit

Key insight: Multimodal agents have the highest risk profile of any AI application because they can take real-world actions. A text hallucination is annoying; an agent hallucination that clicks “Send” on the wrong email is a disaster. Always sandbox, always require human approval for irreversible actions.

build

Building Multimodal Agents

Frameworks, tools, and practical patterns

Agent Frameworks

// Frameworks for multimodal agents Computer Use Anthropic Computer Use API OpenAI Operator Open Interpreter (open-source) Web Agents Browser Use (Python, open-source) Playwright + VLM (custom) Stagehand (TypeScript) General Agent Frameworks LangGraph (stateful agent graphs) CrewAI (multi-agent orchestration) AutoGen (Microsoft, multi-agent) Robotics ROS 2 + VLM integration LeRobot (HuggingFace)

Best Practices

• Start with narrow scope: Automate one specific workflow, not “anything”
• Use structured actions: Define a clear action space (click, type, scroll) not free-form
• Add visual verification: After each action, verify the expected state
• Implement retry logic: Actions fail — detect failure and retry with different approach
• Log everything: Screenshots + actions + reasoning for debugging
• Human escalation: When stuck or uncertain, ask the human

Pro tip: The most successful multimodal agents are narrow and reliable, not broad and fragile. An agent that perfectly automates one 10-step workflow is more valuable than one that attempts anything but fails 30% of the time.

school

Key Takeaways

The future of AI that can see and act

Essential Concepts

1. Perceive-Reason-Act loop: Multimodal agents see the world, reason about it, and take actions

2. Computer use: The “universal API” — automate any software through its visual interface

3. Web agents: Screenshots + DOM for robust web automation

4. VLA models: VLMs extended with action outputs for robotics

5. Safety first: Sandbox, allowlist, human approval for irreversible actions

Where We’re Headed

• 2025: Computer use agents for specific workflows, web research automation
• 2026: Real-time voice + vision assistants, reliable web agents
• 2027+: Embodied agents in homes and workplaces, autonomous physical tasks

Multimodal agents are the bridge between AI that knows things and AI that does things.

Next up: Chapter 15 tackles the critical topic of ethics, deepfakes, and safety in multimodal AI — the risks and responsibilities that come with AI that can generate and manipulate visual media.

arrow_back Ch 13: Building Applications Ch 15: Ethics, Deepfakes & Safety arrow_forward