Ch 13 — Secure AI Architecture Patterns

Zero trust, LLM gateways, LLM firewalls, confidential computing, AI-SPM
High Level
dns
Gateway
arrow_forward
lock
Auth
arrow_forward
speed
Rate Limit
arrow_forward
smart_toy
Model Layer
arrow_forward
build
Tool Layer
arrow_forward
monitoring
Audit
-
Click play or press Space to begin the journey...
Step- / 7
shield
Zero Trust Architecture for AI Systems
Never trust, always verify — applied to LLMs and agents
Why Zero Trust for AI
Traditional perimeter security assumes everything inside the network is trusted. Zero Trust rejects implicit trust in any user, asset, or request. For AI systems, this is critical because:

Unpredictable outputs: LLMs generate dynamic requests that bypass traditional API validation
Machine speed: Agents operate without human oversight, making real-time verification essential
Multi-tenant risk: Shared model infrastructure means one compromised tenant can affect others
Tool chaining: Agents call external tools that expand the attack surface with every hop
Zero Trust Principles for LLMs
1. Continuous verification: Authenticate every request, not just the connection. Per-request token validation, not session-based auth.

2. Least privilege: Grant minimum required permissions per tool call. Scope tokens to specific actions (Ch 8, Ch 9).

3. Micro-segmentation: Isolate model, application, integration, and infrastructure layers. A compromised guardrail shouldn’t expose the model weights.

4. Assume breach: Design for the scenario where any single component is compromised. Layered defenses ensure no single failure is catastrophic.
CSA guidance: The Cloud Security Alliance published “Using Zero Trust to Secure Enterprise Information in LLM Environments” — the first industry framework specifically applying Zero Trust principles to LLM deployments.
dns
The LLM Gateway Pattern
A security proxy between your application and the model
What It Is
An LLM gateway sits between applications and LLM providers, enforcing security controls at a single chokepoint. It mirrors provider endpoints (e.g., OpenAI’s /v1/chat/completions), so applications redirect traffic by changing only the base URL. All requests and responses pass through the gateway’s security pipeline.
8-Stage Security Pipeline
1. Authentication: Per-client API keys with 256-bit entropy
2. Rate limiting: Token-based (not request-based) with sliding windows
3. Model allowlist: Restrict which models clients can access
4. Prompt injection detection: 20+ regex patterns with cumulative risk scoring
5. PII scanning: SSN, credit cards, emails, phones — redact or block
6. Response scanning: Same injection/PII checks on LLM output
7. Provider routing: Load balance across OpenAI, Bedrock, etc.
8. Audit logging: Structured JSON with latency, correlation IDs
# LLM Gateway: conceptual architecture # Application code — only change base URL client = OpenAI( base_url="https://llm-gateway.internal", api_key="client-specific-key" ) # Gateway pipeline (transparent to app): # ┌─────────────────────────────┐ # │ 1. Authenticate client key │ # │ 2. Check token rate limit │ # │ 3. Verify model allowlist │ # │ 4. Scan for prompt injection│ # │ 5. Scan/redact PII │ # │ 6. Forward to provider │ # │ 7. Scan response │ # │ 8. Log everything │ # └─────────────────────────────┘
Streaming challenge: Streaming responses complicate security — full response scanning for PII conflicts with real-time token delivery. Advanced gateways buffer selectively or use streaming-compatible scanners.
lock
Authentication & Token-Based Rate Limiting
OAuth2 scopes, token quotas, and abuse prevention
Authentication Vulnerabilities
LLM API auth faces specific risks:

Bearer token exposure: Keys leak via git history, browser network requests, mobile binaries, and client-side JavaScript
Horizontal escalation: Accessing other users’ conversations
Vertical escalation: Regular users accessing admin functions
Parameter scope bypass: Clients overriding system prompts via API parameters

Replace static API keys with short-lived OAuth2 tokens, role-based scopes, and attribute-based access control.
Token-Based Rate Limiting
Traditional request-based rate limiting is insufficient for LLMs — identical HTTP requests can vary dramatically in resource cost (a 10-token prompt vs. a 10,000-token prompt). Modern approaches limit tokens consumed, not requests sent.

Apache APISIX ai-rate-limiting plugin supports configurable token limits across sliding windows by token type (total, prompt, or completion tokens).

Enterprises need per-client limits, spike arrest, model failover, and fine-grained attribution — none of which provider defaults offer.
OWASP LLM10:2025 — Unbounded Consumption: Without token-based rate limiting, a single malicious or buggy client can exhaust your entire inference budget. Rate limit by tokens, not requests. Set per-client quotas. Alert on anomalous consumption patterns.
local_fire_department
LLM Firewalls: WAF for the AI Era
LlamaFirewall, Akamai, Cloudflare — real-time model protection
What LLM Firewalls Do
An LLM firewall is a specialized security layer that protects LLM applications from AI-specific threats. Unlike traditional WAFs, these address prompt injection, unsafe code generation, agent misalignment, and goal hijacking. They operate as real-time, production-ready layers supporting high-throughput environments.
LlamaFirewall (Meta, 2025)
Open-source guardrail framework with multiple specialized scanners:

PromptGuard 2: Lightweight classifier detecting direct prompt injection with high precision and low latency
AlignmentCheck: Chain-of-thought auditing that inspects agent reasoning for goal hijacking and indirect injection
CodeShield: Static analysis preventing insecure code generation across 8 programming languages
Custom scanners: Regex-based and LLM-prompt scanners for flexible threat detection
Commercial LLM Firewalls
Akamai Firewall for AI: Enterprise LLM security for hybrid environments. Protects against prompt injection, data exfiltration, and model abuse at the edge.

Cloudflare AI Security for Apps: Integrated into their WAF. Prompt injection detection, unsafe/custom topic detection, PII detection and prevention. Available on all Cloudflare plans.
Key differentiator from guardrails (Ch 6): LLM firewalls focus on system-level defenses for agentic operations — code generation, tool orchestration, autonomous decision-making. Guardrails focus on content safety. Use both: guardrails for content, firewalls for system security.
enhanced_encryption
Confidential Computing for AI Inference
TEE-GPU architectures — encrypting data during processing
The Problem
Traditional encryption protects data at rest (stored) and in transit (network). But during inference, data must be decrypted for the GPU to process it — creating a window where sensitive data is exposed in memory. Confidential computing closes this gap by encrypting data during processing using hardware-level Trusted Execution Environments (TEEs).
Current State (2025)
NVIDIA HGX B200 (Blackwell): Production-grade confidential computing with near-native performance. The historical 30–50% performance penalty has been eliminated.

NVIDIA H100: GPU TEE with 4–8% throughput penalty that diminishes with larger batch sizes.

CPU TEEs (Intel TDX/SGX): Under 10% throughput overhead and 20% latency overhead for full LLM inference pipelines.
Hybrid Architecture: SecureInfer
SecureInfer partitions LLM workloads strategically:

Security-sensitive components (attention projections, LoRA adapters) execute inside SGX enclaves
Compute-intensive operations (matrix multiplication) run on GPUs after encryption

This balances security with performance for models like LLaMA-2. Source: arxiv.org/abs/2510.19979
Who needs this: Finance, healthcare, and government sectors under GDPR, HIPAA, and financial regulations. If your compliance requires data protection during processing — not just storage and transit — confidential computing is the answer. Gartner projects AI infrastructure spending reaching $400B by 2027, with security as the top adoption barrier.
visibility
AI Security Posture Management (AI-SPM)
Wiz, Orca, Palo Alto, Microsoft Defender — visibility into AI deployments
What AI-SPM Does
AI-SPM is a cloud security category that addresses the rapid, often uncontrolled expansion of AI deployments. Three core capabilities:

1. Discovery & visibility: Find all AI applications, models, and infrastructure — including shadow AI that bypasses security review
2. Risk identification: Misconfiguration detection, vulnerability scanning, attack path analysis across AI pipelines
3. Data protection: Monitor sensitive data in training datasets, detect data leakage paths
Leading Solutions (2025)
Wiz: AI-BOM discovery, pipeline misconfiguration detection, attack path analysis extended to AI services

Orca Security: Agentless visibility covering 50+ AI models and packages, DSPM for AI training data

Palo Alto (Cortex Cloud): AI ecosystem visibility, model risk analysis, data classification across training pipelines

Microsoft Defender for Cloud: Multicloud AI workload discovery (Azure, AWS, GCP) with AI-BOM generation
Why AI-SPM matters: You can’t secure what you can’t see. OWASP’s GenAI Security Solutions Reference Guide (Q2–Q3 2025) identifies AI-SPM as an emerging technology category alongside LLM Firewalls and Guardrails. It’s the governance layer (Ch 12) made operational.
layers
Defense in Depth: The Complete Stack
OWASP GenAI Security — layered architecture for production AI
The Layered Architecture
Layer 1 — Edge (Gateway): LLM gateway with auth, rate limiting, model allowlists. Single security chokepoint for all AI traffic.

Layer 2 — Input (Firewall): LLM firewall (LlamaFirewall, Cloudflare) for prompt injection detection, PII scanning, input validation.

Layer 3 — Model (Isolation): Confidential computing (TEE-GPU), micro-segmentation, context window hygiene, memory isolation.

Layer 4 — Tools (Sandbox): WASM/container sandboxing (Ch 8), least privilege, human-in-the-loop for high-stakes actions.

Layer 5 — Output (Guardrails): Response scanning, output filtering (Ch 6), canary token detection.

Layer 6 — Observability (AI-SPM): Shadow AI discovery, anomaly detection, audit logging, compliance monitoring.
OWASP LLMSecOps Lifecycle
OWASP’s GenAI Security Solutions Reference Guide (2025) defines a structured LLMSecOps lifecycle across four phases:

Planning: Threat modeling (MITRE ATLAS), risk assessment (NIST AI RMF)
Data handling: PII scanning, data governance, AI-BOM
Deployment: Gateway, firewall, sandboxing, confidential computing
Monitoring: AI-SPM, red teaming (Ch 11), incident response
The architecture principle: No single layer is sufficient. Prompt injection bypasses guardrails? The firewall catches it. Firewall fails? The sandbox limits damage. Sandbox escapes? AI-SPM detects the anomaly. Each layer compensates for the others’ failures. This is defense in depth for the AI era.