Ch 13: Secure AI Architecture Patterns

Ch 13 — Secure AI Architecture Patterns

Zero trust, LLM gateways, LLM firewalls, confidential computing, AI-SPM

Index Under the Hood →

High Level

dns

Gateway

arrow_forward

lock

Auth

arrow_forward

speed

Rate Limit

arrow_forward

smart_toy

Model Layer

arrow_forward

build

Tool Layer

arrow_forward

monitoring

Audit

Click play or press Space to begin the journey...

Step- / 7

shield

Zero Trust Architecture for AI Systems

Never trust, always verify — applied to LLMs and agents

Why Zero Trust for AI

Traditional perimeter security assumes everything inside the network is trusted. Zero Trust rejects implicit trust in any user, asset, or request. For AI systems, this is critical because:

Unpredictable outputs: LLMs generate dynamic requests that bypass traditional API validation
Machine speed: Agents operate without human oversight, making real-time verification essential
Multi-tenant risk: Shared model infrastructure means one compromised tenant can affect others
Tool chaining: Agents call external tools that expand the attack surface with every hop

Zero Trust Principles for LLMs

1. Continuous verification: Authenticate every request, not just the connection. Per-request token validation, not session-based auth.

2. Least privilege: Grant minimum required permissions per tool call. Scope tokens to specific actions (Ch 8, Ch 9).

3. Micro-segmentation: Isolate model, application, integration, and infrastructure layers. A compromised guardrail shouldn’t expose the model weights.

4. Assume breach: Design for the scenario where any single component is compromised. Layered defenses ensure no single failure is catastrophic.

CSA guidance: The Cloud Security Alliance published “Using Zero Trust to Secure Enterprise Information in LLM Environments” — the first industry framework specifically applying Zero Trust principles to LLM deployments.

dns

The LLM Gateway Pattern

A security proxy between your application and the model

What It Is

An LLM gateway sits between applications and LLM providers, enforcing security controls at a single chokepoint. It mirrors provider endpoints (e.g., OpenAI’s /v1/chat/completions), so applications redirect traffic by changing only the base URL. All requests and responses pass through the gateway’s security pipeline.

8-Stage Security Pipeline

1. Authentication: Per-client API keys with 256-bit entropy
2. Rate limiting: Token-based (not request-based) with sliding windows
3. Model allowlist: Restrict which models clients can access
4. Prompt injection detection: 20+ regex patterns with cumulative risk scoring
5. PII scanning: SSN, credit cards, emails, phones — redact or block
6. Response scanning: Same injection/PII checks on LLM output
7. Provider routing: Load balance across OpenAI, Bedrock, etc.
8. Audit logging: Structured JSON with latency, correlation IDs

# LLM Gateway: conceptual architecture # Application code — only change base URL client = OpenAI( base_url="https://llm-gateway.internal", api_key="client-specific-key" ) # Gateway pipeline (transparent to app): # ┌─────────────────────────────┐ # │ 1. Authenticate client key │ # │ 2. Check token rate limit │ # │ 3. Verify model allowlist │ # │ 4. Scan for prompt injection│ # │ 5. Scan/redact PII │ # │ 6. Forward to provider │ # │ 7. Scan response │ # │ 8. Log everything │ # └─────────────────────────────┘

Streaming challenge: Streaming responses complicate security — full response scanning for PII conflicts with real-time token delivery. Advanced gateways buffer selectively or use streaming-compatible scanners.

lock

Authentication & Token-Based Rate Limiting

OAuth2 scopes, token quotas, and abuse prevention

Authentication Vulnerabilities

LLM API auth faces specific risks:

Bearer token exposure: Keys leak via git history, browser network requests, mobile binaries, and client-side JavaScript
Horizontal escalation: Accessing other users’ conversations
Vertical escalation: Regular users accessing admin functions
Parameter scope bypass: Clients overriding system prompts via API parameters

Replace static API keys with short-lived OAuth2 tokens, role-based scopes, and attribute-based access control.

Token-Based Rate Limiting

Traditional request-based rate limiting is insufficient for LLMs — identical HTTP requests can vary dramatically in resource cost (a 10-token prompt vs. a 10,000-token prompt). Modern approaches limit tokens consumed, not requests sent.

Apache APISIX ai-rate-limiting plugin supports configurable token limits across sliding windows by token type (total, prompt, or completion tokens).

Enterprises need per-client limits, spike arrest, model failover, and fine-grained attribution — none of which provider defaults offer.

OWASP LLM10:2025 — Unbounded Consumption: Without token-based rate limiting, a single malicious or buggy client can exhaust your entire inference budget. Rate limit by tokens, not requests. Set per-client quotas. Alert on anomalous consumption patterns.

local_fire_department

LLM Firewalls: WAF for the AI Era

LlamaFirewall, Akamai, Cloudflare — real-time model protection

What LLM Firewalls Do

An LLM firewall is a specialized security layer that protects LLM applications from AI-specific threats. Unlike traditional WAFs, these address prompt injection, unsafe code generation, agent misalignment, and goal hijacking. They operate as real-time, production-ready layers supporting high-throughput environments.

LlamaFirewall (Meta, 2025)

Open-source guardrail framework with multiple specialized scanners:

PromptGuard 2: Lightweight classifier detecting direct prompt injection with high precision and low latency
AlignmentCheck: Chain-of-thought auditing that inspects agent reasoning for goal hijacking and indirect injection
CodeShield: Static analysis preventing insecure code generation across 8 programming languages
Custom scanners: Regex-based and LLM-prompt scanners for flexible threat detection

Commercial LLM Firewalls

Akamai Firewall for AI: Enterprise LLM security for hybrid environments. Protects against prompt injection, data exfiltration, and model abuse at the edge.

Cloudflare AI Security for Apps: Integrated into their WAF. Prompt injection detection, unsafe/custom topic detection, PII detection and prevention. Available on all Cloudflare plans.

Key differentiator from guardrails (Ch 6): LLM firewalls focus on system-level defenses for agentic operations — code generation, tool orchestration, autonomous decision-making. Guardrails focus on content safety. Use both: guardrails for content, firewalls for system security.

enhanced_encryption

Confidential Computing for AI Inference

TEE-GPU architectures — encrypting data during processing

The Problem

Traditional encryption protects data at rest (stored) and in transit (network). But during inference, data must be decrypted for the GPU to process it — creating a window where sensitive data is exposed in memory. Confidential computing closes this gap by encrypting data during processing using hardware-level Trusted Execution Environments (TEEs).

Current State (2025)

NVIDIA HGX B200 (Blackwell): Production-grade confidential computing with near-native performance. The historical 30–50% performance penalty has been eliminated.

NVIDIA H100: GPU TEE with 4–8% throughput penalty that diminishes with larger batch sizes.

CPU TEEs (Intel TDX/SGX): Under 10% throughput overhead and 20% latency overhead for full LLM inference pipelines.

Hybrid Architecture: SecureInfer

SecureInfer partitions LLM workloads strategically:

• Security-sensitive components (attention projections, LoRA adapters) execute inside SGX enclaves
• Compute-intensive operations (matrix multiplication) run on GPUs after encryption

This balances security with performance for models like LLaMA-2. Source: arxiv.org/abs/2510.19979

Who needs this: Finance, healthcare, and government sectors under GDPR, HIPAA, and financial regulations. If your compliance requires data protection during processing — not just storage and transit — confidential computing is the answer. Gartner projects AI infrastructure spending reaching $400B by 2027, with security as the top adoption barrier.

visibility

AI Security Posture Management (AI-SPM)

Wiz, Orca, Palo Alto, Microsoft Defender — visibility into AI deployments

What AI-SPM Does

AI-SPM is a cloud security category that addresses the rapid, often uncontrolled expansion of AI deployments. Three core capabilities:

1. Discovery & visibility: Find all AI applications, models, and infrastructure — including shadow AI that bypasses security review
2. Risk identification: Misconfiguration detection, vulnerability scanning, attack path analysis across AI pipelines
3. Data protection: Monitor sensitive data in training datasets, detect data leakage paths

Leading Solutions (2025)

Wiz: AI-BOM discovery, pipeline misconfiguration detection, attack path analysis extended to AI services

Orca Security: Agentless visibility covering 50+ AI models and packages, DSPM for AI training data

Palo Alto (Cortex Cloud): AI ecosystem visibility, model risk analysis, data classification across training pipelines

Microsoft Defender for Cloud: Multicloud AI workload discovery (Azure, AWS, GCP) with AI-BOM generation

Why AI-SPM matters: You can’t secure what you can’t see. OWASP’s GenAI Security Solutions Reference Guide (Q2–Q3 2025) identifies AI-SPM as an emerging technology category alongside LLM Firewalls and Guardrails. It’s the governance layer (Ch 12) made operational.

layers

Defense in Depth: The Complete Stack

OWASP GenAI Security — layered architecture for production AI

The Layered Architecture

Layer 1 — Edge (Gateway): LLM gateway with auth, rate limiting, model allowlists. Single security chokepoint for all AI traffic.

Layer 2 — Input (Firewall): LLM firewall (LlamaFirewall, Cloudflare) for prompt injection detection, PII scanning, input validation.

Layer 3 — Model (Isolation): Confidential computing (TEE-GPU), micro-segmentation, context window hygiene, memory isolation.

Layer 4 — Tools (Sandbox): WASM/container sandboxing (Ch 8), least privilege, human-in-the-loop for high-stakes actions.

Layer 5 — Output (Guardrails): Response scanning, output filtering (Ch 6), canary token detection.

Layer 6 — Observability (AI-SPM): Shadow AI discovery, anomaly detection, audit logging, compliance monitoring.

OWASP LLMSecOps Lifecycle

OWASP’s GenAI Security Solutions Reference Guide (2025) defines a structured LLMSecOps lifecycle across four phases:

Planning: Threat modeling (MITRE ATLAS), risk assessment (NIST AI RMF)
Data handling: PII scanning, data governance, AI-BOM
Deployment: Gateway, firewall, sandboxing, confidential computing
Monitoring: AI-SPM, red teaming (Ch 11), incident response

The architecture principle: No single layer is sufficient. Prompt injection bypasses guardrails? The firewall catches it. Firewall fails? The sandbox limits damage. Sandbox escapes? AI-SPM detects the anomaly. Each layer compensates for the others’ failures. This is defense in depth for the AI era.