Key Insights — AI Security

Attack Surface

Injection, Jailbreaks & Poisoning

Chapters 1-5

expand_more

1

AI introduces entirely new attack vectors where inputs are natural language, outputs are probabilistic, and the blast radius is expanding.

OWASP Top 10 for LLMs: The definitive list of AI vulnerabilities, with Prompt Injection (LLM01) at the top.
MITRE ATLAS: The ATT&CK framework adapted for AI, mapping adversarial tactics like evasion, poisoning, and extraction.

2

Prompt Injection

The fundamental flaw of LLMs: they cannot reliably distinguish between developer instructions and user inputs.

Direct Injection: A user intentionally types malicious commands to override the system prompt or extract hidden instructions.
Indirect Injection: Malicious instructions are hidden in external data (like a webpage or email) that the LLM retrieves and processes.

3

Jailbreaking

Bypassing a model's safety alignment to elicit prohibited, harmful, or toxic outputs.

Techniques: Attackers use role-play (e.g., "DAN"), hypothetical scenarios, or encoded payloads (Base64) to trick the model into ignoring its safety training.
Many-Shot Jailbreaks: Overwhelming the model's context window with hundreds of fake "successful" malicious interactions to normalize bad behavior.

4

Data Poisoning & Supply Chain

Corrupting the model at its source by manipulating training data or compromising dependencies.

Sleeper Agents: Models trained to behave normally until a specific trigger word is present, at which point they execute malicious behavior.
Pickle Exploits: Malicious models hosted on platforms like Hugging Face that execute arbitrary code when downloaded and deserialized.

5

Adversarial Machine Learning

Mathematical attacks that exploit the high-dimensional geometry of neural networks.

Adversarial Examples: Adding invisible noise to an image (e.g., a stop sign) that causes a computer vision model to misclassify it with high confidence.

The Bottom Line: Because LLMs parse instructions and data in the same stream, perfect prevention of prompt injection is currently considered mathematically impossible. Defense requires a layered approach.

Defense

Guardrails, RAG & Agents

Chapters 6-9

expand_more

6

Guardrails & Filtering

The first line of defense: intercepting malicious inputs and sanitizing harmful outputs.

Input/Output Filtering: Using smaller, specialized models (like Llama Guard) to scan prompts for injections and responses for PII or toxicity.
Semantic Routing: Directing safe queries to the main LLM and blocking or redirecting unsafe queries before they cost compute.

7

Securing RAG Systems

Retrieval-Augmented Generation turns your internal documents into an attack surface.

RAG Poisoning: Attackers insert malicious instructions into documents (like resumes or public wikis) that they know the RAG system will ingest.
Access Control: Ensure the LLM only retrieves documents the current user actually has permission to view.

8

Securing Autonomous Agents

When AI can take actions (API calls, code execution), the blast radius of an attack expands exponentially.

Excessive Agency (LLM06): Giving an agent broad permissions (e.g., full database write access) instead of scoping tools to least privilege.
Human-in-the-Loop: Requiring explicit user approval before an agent executes destructive or high-stakes actions.

9

Securing MCP

The Model Context Protocol standardizes tool use, but standardizes the attack vectors along with it.

Tool Poisoning: Hiding malicious instructions inside the schema definitions of MCP tools.
Sandboxing: Running agent code execution environments in isolated containers (like Docker or WASM) to prevent system compromise.

The Bottom Line: Defense in depth is mandatory. You must assume the LLM will eventually be compromised by a prompt injection, and build sandboxes and guardrails to contain the damage.

Risk

Privacy, Red Teaming & Compliance

Chapters 10-12

expand_more

10

Data Privacy & Model Extraction

Models memorize their training data, making them vulnerable to privacy leaks.

Membership Inference: Attacks that determine if a specific person's data was used in the training set.
Model Extraction: Stealing a proprietary model by querying it millions of times and using the outputs to train a clone.

11

AI Red Teaming

Proactively attacking your own AI systems to find vulnerabilities before deployment.

Automated Red Teaming: Using specialized LLMs to generate thousands of adversarial prompts to stress-test your application (e.g., using tools like PromptFoo or Garak).
Continuous Evaluation: Red teaming is not a one-time audit; it must be integrated into the CI/CD pipeline as models and attacks evolve.

12

Governance, Risk & Compliance

Navigating the rapidly evolving legal and regulatory landscape for AI.

EU AI Act: The first comprehensive legal framework classifying AI systems by risk level, with heavy fines for non-compliance.
NIST AI RMF: The US framework for managing AI risk, focusing on mapping, measuring, and managing potential harms.

The Bottom Line: Security testing for AI must be automated and continuous. Relying on manual penetration testing is insufficient against the scale and creativity of LLM-based attacks.

Architecture

Hardening Production Systems

Chapters 13-14

expand_more

13

Secure AI Architecture

Designing systems that remain secure even when the core LLM is compromised.

Zero-Trust AI: Never trust the output of an LLM. Treat it as untrusted user input before passing it to databases or APIs.
API Gateways: Centralizing rate limiting, authentication, and guardrail execution to protect backend models from unbounded consumption (LLM10).

14

Hardening Production Systems

Implementing robust incident response and monitoring for AI applications.

Observability: Logging full prompt/response pairs, tool calls, and latency to detect anomalous behavior or slow-burn data poisoning.
Incident Response: Having a clear playbook for when an AI goes rogue, including "kill switches" to instantly degrade to safe fallbacks.

The Bottom Line: Treat the LLM as a potentially hostile actor inside your network. Isolate it, monitor it, and strictly limit what it can execute.