Risk Categories
• Unintended actions: Agent clicks the wrong button, sends an email, or deletes a file
• Scope creep: Agent takes actions beyond what was requested
• Prompt injection: Malicious content on a webpage tricks the agent into harmful actions
• Data exfiltration: Agent accidentally sends sensitive data to external services
• Irreversible actions: Agent makes purchases, deletes data, or sends messages that can’t be undone
Safety Patterns
• Sandboxing: Run agents in isolated environments (VMs, containers)
• Action allowlists: Only permit specific actions (no delete, no purchase, no send)
• Human approval: Require confirmation for high-risk actions
• Step limits: Cap the number of actions per task
• Rollback capability: Undo recent actions if something goes wrong
• Monitoring: Log every action with screenshots for audit
Key insight: Multimodal agents have the highest risk profile of any AI application because they can take real-world actions. A text hallucination is annoying; an agent hallucination that clicks “Send” on the wrong email is a disaster. Always sandbox, always require human approval for irreversible actions.