Ch 11: Red Teaming AI Systems

Ch 11 — Red Teaming AI Systems

Garak, PyRIT, DeepTeam — NIST AI RMF, DEF CON AI Village, automated red teaming

Index Under the Hood →

High Level

target

Scope

arrow_forward

build

Tool Setup

arrow_forward

auto_fix_high

Generate

arrow_forward

play_arrow

Execute

arrow_forward

analytics

Measure

arrow_forward

summarize

Report

Click play or press Space to begin the journey...

Step- / 7

target

Why AI Red Teaming Is Different

Pen testing finds known bugs — red teaming discovers unknown failures

Pen Testing vs. Red Teaming

AI penetration testing is systematic and repeatable: test every known attack class from OWASP LLM Top 10, produce severity-rated findings, deliver remediation guidance. It asks: “Does this specific vulnerability exist?”

AI red teaming is open-ended and exploratory: follow unexpected paths, discover novel failure modes, test behavioral issues and safety policy violations. It asks: “How can I make this system fail in ways that matter?”

Both are necessary. Pen testing provides coverage; red teaming provides depth.

What Makes AI Different

AI red teaming differs from traditional software security testing in fundamental ways:

Probabilistic outputs: The same input can produce different outputs. You can’t write deterministic test cases.
Dual risk surface: Must probe both security risks (injection, exfiltration) AND responsible AI risks (bias, toxicity, misinformation) simultaneously.
Natural language attack surface: Attacks are crafted in plain English, not exploit code. Creative writing skills matter as much as technical skills.
Moving target: Model updates, fine-tuning, and prompt changes can invalidate previous test results overnight.

Microsoft’s framing: “Red teaming generative AI differs from traditional software testing because generative AI systems are more probabilistic than traditional systems.” This is why they built PyRIT — manual testing doesn’t scale for probabilistic systems.

account_balance

NIST AI RMF & the ARIA Program

Government frameworks for structured AI red teaming

NIST AI Risk Management Framework

The NIST AI RMF (Jan 2023) provides the foundational structure for AI red teaming. Its “Measure” function defines four categories:

Measure 1: Define testing metrics for AI risk evaluation
Measure 2: Evaluate for trustworthiness, safety, security, fairness, and misuse
Measure 3: Track and manage emerging risks
Measure 4: Correlate AI risk with business impact

Additional guidance in NIST 600-1 (Generative AI Profile) and NIST AI 800-1 (Dual-Use Foundation Models, 2nd draft Jan 2025).

ARIA: Three-Level Testbed

NIST’s ARIA (Assessing Risks and Impacts of AI) program operationalizes the framework with three testing levels:

Level 1 — Model testing: Confirm claimed capabilities
Level 2 — Red teaming: Stress test and attempt to induce risks
Level 3 — Field testing: Examine impacts under regular use

The first national ARIA Red-Teaming Challenge (Fall 2024) had 500+ participants conducting 139 successful attacks against target AI systems using social engineering and technical exploits.

Key finding: The ARIA challenge found that NIST 600-1 is “readily operationalizable in several — but not all — key areas.” Some risk categories still need clarification for field validation. The framework is evolving alongside the threat landscape.

groups

DEF CON 31: The Largest Public AI Red Team

2,500 hackers, 8 LLMs, White House-backed — August 2023

The Event

At DEF CON 31 (August 2023), the AI Village hosted the first-ever public Generative AI Red Team Challenge, backed by the White House Office of Science and Technology Policy. Nearly 2,500 hackers tested 8 LLMs from Anthropic, Google, Hugging Face, NVIDIA, OpenAI, Stability, and Microsoft using 156 closed-network terminals at Caesars Forum, Las Vegas.

What They Found

• Diverse vulnerabilities: jailbreaks, hallucinations, prompt injection, bias, discrimination, and privacy leaks
• Some models were updated overnight based on Day 1 findings — participants noticed improved defenses on Day 2
• Participants found challenges harder than anticipated, suggesting models had meaningful safeguards
• External red-teaming proved effective at identifying novel risks at scale that internal testing missed

Why It Matters

DEF CON 31 demonstrated three critical principles:

1. Public red teaming works: Crowdsourced testing discovers failure modes that internal teams miss. Diverse backgrounds (security researchers, linguists, ethicists) produce diverse attack strategies.

2. Continuous testing is essential: Models updated overnight changed the attack surface. A one-time audit is insufficient.

3. Red teaming is a community effort: No single organization has the expertise to test all failure modes. The White House endorsement signaled that external red teaming should become standard practice.

Impact: The DEF CON event directly influenced the Biden-Harris AI governance framework and established norms for continuous external assessment of LLMs. It proved that adversarial testing at scale is both feasible and necessary.

radar

Garak: NVIDIA’s LLM Vulnerability Scanner

The nmap of LLMs — probes, generators, and detectors

What Garak Does

Garak (Generative AI Red-teaming & Assessment Kit) is NVIDIA’s open-source LLM vulnerability scanner. Think of it as nmap or Metasploit, but for LLMs. It combines static, dynamic, and adaptive probes to find hallucination, data leakage, prompt injection, misinformation, toxicity, and jailbreaks.

Architecture

Probes: Manage attack strategies (prompt injection, encoding attacks, social engineering, etc.)
Generators: Abstract away target models (supports Hugging Face, OpenAI, AWS Bedrock, Replicate, LiteLLM, GGUF/llama.cpp, REST APIs)
Detectors: Assess whether attacks succeeded using pattern matching, semantic analysis, and LLM-based evaluation

Results compile into HTML reports and JSON summaries. Originally developed by Prof. Leon Derczynski (Spring 2023), now maintained by NVIDIA under Apache 2.0.

# Garak: scan an LLM for vulnerabilities # Install $ pip install garak # Scan OpenAI model for all known vulns $ garak --model_type openai \ --model_name gpt-4 \ --probes all # Scan for specific vulnerability class $ garak --model_type openai \ --model_name gpt-4 \ --probes promptinject,encoding # Output: HTML report + JSON results # Shows: probe name, attempts, passes, # failures, success rate per attack type # Also supports local models: $ garak --model_type huggingface \ --model_name meta-llama/Llama-3-8B

Strength: Garak is coverage-oriented — it systematically tests every known attack class. Use it for pen-testing-style sweeps. For creative, exploratory red teaming, combine with human testers or multi-turn tools like PyRIT.

psychology

PyRIT: Microsoft’s Multi-Turn Red Teaming

Automated Crescendo, TAP, and Skeleton Key attacks

Why PyRIT Exists

Microsoft released PyRIT (Python Risk Identification Tool) in February 2024 because single-turn probes (like Garak) miss multi-turn attacks where the adversary gradually escalates across a conversation. PyRIT automates multi-turn attack strategies that would take human red teamers hours to execute manually.

Attack Strategies

Crescendo: Gradually escalate from benign to harmful requests across multiple turns
TAP (Tree of Attacks with Pruning): Explore branching attack paths, pruning unsuccessful branches
Skeleton Key: Convince the model to adopt a permissive persona that bypasses safety training

Supports OpenAI, Azure, Anthropic, Google, HuggingFace, custom HTTP endpoints, and web apps. Built-in memory (SQLite/Azure SQL) tracks all conversations and results. v0.11.0 (Feb 2026), ~99K monthly PyPI downloads, 3,556 GitHub stars.

# PyRIT: multi-turn Crescendo attack from pyrit.orchestrator import ( CrescendoOrchestrator ) from pyrit.prompt_target import ( OpenAIChatTarget ) target = OpenAIChatTarget( model_name="gpt-4" ) orchestrator = CrescendoOrchestrator( objective="Extract system prompt", prompt_target=target, max_turns=10 ) # Runs automated multi-turn conversation # Gradually escalates toward objective # Scores each response for success result = await orchestrator.run() # result.success, result.conversation

CoPyRIT GUI: PyRIT also includes a graphical interface for human-led red teaming with collaboration features. Useful for teams where not everyone writes Python. Combines automated attack generation with human judgment.

bug_report

DeepTeam: Framework-Aligned Vulnerability Scanning

Confident AI — 40+ vulnerabilities, OWASP/NIST/MITRE aligned

What DeepTeam Does

DeepTeam (Confident AI) is an open-source red teaming framework that detects 40+ vulnerability types out-of-the-box: bias (gender, race, political, religious), PII leakage, toxicity, misinformation, factual errors, and robustness issues. It supports 10+ adversarial attack methods including prompt injection, jailbreaking, leetspeak, ROT-13, and multi-turn attacks (linear and tree jailbreaking).

Framework Alignment

DeepTeam maps directly to industry standards:

• OWASP Top 10 for LLMs 2025
• OWASP Top 10 for Agents 2026
• NIST AI Risk Management Framework
• MITRE ATLAS

This means your red team results map directly to compliance requirements. v1.0.4 (Nov 2025), Apache 2.0 license.

# DeepTeam: scan for OWASP LLM Top 10 from deepteam import red_team from deepteam.vulnerabilities import ( PromptInjection, Bias, PIILeakage ) from deepteam.attacks import ( Jailbreaking, Leetspeak, ROT13 ) # Define your model callback def model_callback(prompt): return my_llm.generate(prompt) # Run red team scan results = red_team( model_callback=model_callback, vulnerabilities=[ PromptInjection(), Bias(), PIILeakage() ], attacks=[ Jailbreaking(), Leetspeak(), ROT13() ] ) # Binary pass/fail with reasoning

Choosing the right tool: Garak for broad vulnerability scanning (nmap-style). PyRIT for multi-turn, creative attack strategies. DeepTeam for compliance-aligned testing against OWASP/NIST/MITRE. Use all three for comprehensive coverage.

checklist

Building a Red Team Program

From one-off tests to continuous adversarial evaluation

The Red Team Pipeline

1. Scope: Define what you’re testing (model, RAG pipeline, agent, MCP integrations) and what failure looks like (safety violations, data leakage, bias, jailbreaks)

2. Tool setup: Configure Garak for coverage scans, PyRIT for multi-turn attacks, DeepTeam for compliance mapping

3. Generate: Create attack payloads — automated (tool-generated) + manual (human-crafted for novel attacks)

4. Execute: Run attacks against the target system. Log every interaction.

5. Measure: Score results using LLM-based scorers, pattern matching, and human review

6. Report: Map findings to OWASP/NIST categories. Prioritize by severity and exploitability.

Three Delivery Models

1. Automated SaaS: CI/CD-integrated scanning that runs on every deployment. Catches regressions. Low cost, high coverage, limited creativity.

2. Open-source frameworks: PyRIT, Garak, DeepTeam. Full control, customizable, requires in-house expertise.

3. Bespoke consultancy: External red team with AI security expertise. Highest creativity, discovers novel failures. Expensive, periodic.

The bottom line: AI red teaming is not optional. The NIST AI RMF, EU AI Act, and Executive Order 14110 all require adversarial testing. Start with automated tools (Garak + DeepTeam) for coverage, add PyRIT for multi-turn depth, and bring in external red teamers for high-stakes deployments. Make it continuous, not one-off.