Ch 6 — Input Guardrails & Output Filtering

NeMo Guardrails, LLM Guard, Lakera Guard, OpenAI Moderation — the defensive toolchain
High Level
person
User Input
arrow_forward
filter_alt
Input Guard
arrow_forward
smart_toy
LLM
arrow_forward
verified_user
Output Guard
arrow_forward
check_circle
Safe Response
-
Click play or press Space to begin the journey...
Step- / 7
shield
Why Guardrails: Runtime Defense for LLMs
The model can’t protect itself — external layers must enforce safety
The Problem
Chapters 2–5 showed that LLMs are vulnerable to prompt injection, jailbreaking, and adversarial inputs. Safety training (RLHF) helps but is not sufficient — Anthropic’s Sleeper Agents paper proved backdoors survive it. The model cannot reliably police itself. Guardrails are external software layers that inspect inputs before they reach the LLM and filter outputs before they reach the user.
Two-Sided Defense
Input guardrails — Detect and block prompt injection, jailbreak attempts, toxic content, PII, and off-topic requests before the LLM processes them

Output guardrails — Scan LLM responses for harmful content, data leakage, hallucinated facts, system prompt leakage, and policy violations before the user sees them
# The guardrail pattern user_input = get_user_message() # 1. Input guard if input_guard.scan(user_input).flagged: return "Request blocked" # 2. LLM generates response response = llm.generate(user_input) # 3. Output guard if output_guard.scan(response).flagged: return "Response filtered" # 4. Safe to deliver return response
Defense in depth: No single guardrail catches everything. Production systems layer multiple tools: a fast regex/classifier for obvious attacks, an LLM-as-judge for nuanced cases, and canary tokens for exfiltration detection.
code
NVIDIA NeMo Guardrails
Open-source programmable rails with Colang 2.0 — v0.10.1 (beta)
What It Is
NeMo Guardrails is NVIDIA’s open-source toolkit for adding programmable guardrails to LLM applications. It uses Colang 2.0, a Python-like domain-specific language for defining conversational rails. You write flows that control what the bot can discuss, enforce dialog paths, and validate inputs/outputs — all as code, not just prompts.
Colang 2.0 Architecture
Colang 2.0 (available in NeMo Guardrails ≥0.9) is a complete rewrite from 1.0. Key features: parallel flow execution, a standard library, async actions, a generation operator (...), and an explicit main flow entry point. It supports input rails, output rails, dialog rails, and multimodal rails.
# NeMo Guardrails — Colang 2.0 example define flow main activate greeting activate input_safety_check activate output_safety_check define flow input_safety_check when user said something if contains_injection(user_message): bot say "I can't process that request." abort define flow output_safety_check when bot said something if contains_pii(bot_message): bot say "[Response redacted for safety]"
Strength: Programmable, auditable, version-controlled safety logic. Tradeoff: Requires learning Colang and adds latency from the flow engine. Best for applications needing deterministic, policy-driven control over conversations.
security
LLM Guard by Protect AI
Open-source scanner library — MIT license, 2,680+ GitHub stars
Scanner-Based Architecture
LLM Guard takes a different approach from NeMo: instead of a flow language, it provides a library of composable scanners that you chain into a pipeline. Each scanner checks for one specific threat. Input scanners protect prompts; output scanners protect responses. Install with pip install llm-guard. Requires Python ≥3.9.
Available Scanners
Input: Anonymize, BanCode, BanSubstrings, BanTopics, Gibberish detection, Invisible text detection, Language detection, Prompt injection detection, Regex patterns, Secrets detection

Output: Deanonymize, NoRefusal, Relevance, Sensitive content, Bias detection, Regex filtering
# LLM Guard — scanner pipeline from llm_guard import scan_prompt, scan_output from llm_guard.input_scanners import ( PromptInjection, Secrets, Anonymize ) from llm_guard.output_scanners import ( Sensitive, NoRefusal ) input_scanners = [ PromptInjection(), Secrets(), Anonymize(), ] sanitized, results, valid = scan_prompt( input_scanners, prompt ) if not valid: return "Blocked"
Strength: Modular, open-source (MIT), easy to integrate. Each scanner is independent — add or remove as needed. Tradeoff: Scanner quality varies; prompt injection detection relies on classifier models that can be bypassed by novel attacks.
cloud
Lakera Guard: API-Based Detection
Real-time threat detection across 100+ languages — updated daily
How It Works
Lakera Guard is a hosted API service (not a library you run locally). You send user inputs, reference documents, and model outputs to the /v2/guard endpoint, and it returns flagged: true/false with detected threat categories. It screens for prompt injection, jailbreaks, data leakage, and inappropriate content across 100+ languages and scripts.
Threat Intelligence
Lakera’s detection models are updated daily using threat intelligence from analyzing 100K+ Gandalf attacks per day (Gandalf is Lakera’s public prompt injection challenge). This gives them a continuously evolving dataset of real attack patterns, not just academic examples.
# Lakera Guard — API call import requests response = requests.post( "https://api.lakera.ai/v2/guard", headers={ "Authorization": f"Bearer {API_KEY}", }, json={ "messages": [{ "role": "user", "content": user_input }], "project_id": "my-project" } ) if response.json()["flagged"]: return "Threat detected"
Strength: Ultra-low latency, multilingual, continuously updated from real attacks. Tradeoff: Hosted service (data leaves your infrastructure), API dependency, cost at scale. Best for teams that want managed security without running their own models.
gavel
OpenAI Moderation & Guardrails AI
Provider-side safety + structured output validation
OpenAI’s Built-In Defenses
OpenAI provides two complementary mechanisms. The Moderation API (free, using omni-moderation-latest built on GPT-4o) classifies text and images across categories: hate, violence, self-harm, sexual content, harassment, and illicit activities. It supports 40+ languages with a 42% accuracy improvement over the previous model. Separately, the Instruction Hierarchy (April 2024) trains models to prioritize system prompts over user instructions when they conflict.
Guardrails AI: Output Validation
Guardrails AI focuses on structured output validation rather than safety scanning. It uses Pydantic-based validators to ensure LLM outputs match expected schemas. Validators return PassResult or FailResult, with configurable on_fail policies (fix, re-ask, filter). Useful for enforcing JSON structure, regex patterns, and business rules on LLM outputs.
# OpenAI Moderation API response = openai.moderations.create( model="omni-moderation-latest", input=llm_output ) if response.results[0].flagged: return "Content policy violation" # Guardrails AI — structured validation from guardrails import Guard from pydantic import BaseModel class SafeResponse(BaseModel): answer: str confidence: float sources: list[str] guard = Guard.for_pydantic(SafeResponse) result = guard(llm.generate, prompt=query) # Validates structure + field constraints
Different roles: OpenAI Moderation catches harmful content. Guardrails AI enforces output structure. They solve different problems and can be layered together.
nest_cam_wired_stand
Canary Tokens & LLM-as-Judge
Exfiltration detection and nuanced safety evaluation
Canary Tokens
A canary token is a cryptographically random string injected into the system prompt at the start of each interaction. If the LLM’s response contains the canary, it means the model leaked its system prompt — likely due to a prompt injection attack (OWASP LLM07:2025). The check is a simple exact string match on the output, making it fast and reliable. LangChain4j has implemented this as a built-in guardrail feature.
LLM-as-Judge
For nuanced safety decisions that regex and classifiers can’t handle, a second LLM evaluates the first LLM’s output. The judge LLM is prompted with safety criteria and returns a pass/fail verdict. This catches subtle policy violations that pattern matching misses.
LLM-as-Judge Limitations
Research shows significant weaknesses: small stylistic changes in outputs can cause false negative rates to jump by up to 0.24. Adversarial attacks can fool judges into misclassifying 100% of harmful generations as safe. Judges exhibit prompt sensitivity and distribution shifts between benchmarks and real-world deployment. Source: Eiras et al., ICML 2025 “Know Thy Judge.”
The tradeoff: LLM-as-judge catches things classifiers miss, but adds latency, cost, and its own attack surface. A compromised judge LLM is worse than no judge at all. Human review remains necessary for high-stakes decisions.
layers
Layered Defense & Honest Limitations
No single guardrail is sufficient — combine, layer, and accept the gaps
The Production Stack
Layer 1 — Fast classifiers: Regex, keyword blocklists, LLM Guard scanners. Catches obvious attacks in <10ms.

Layer 2 — ML classifiers: Lakera Guard or fine-tuned injection detectors. Catches sophisticated attacks in 50–200ms.

Layer 3 — Programmable rails: NeMo Guardrails Colang flows for dialog control and topic enforcement.

Layer 4 — Output validation: Guardrails AI for structure, OpenAI Moderation for content, canary tokens for leakage.

Layer 5 — LLM-as-judge: For edge cases that escape all other layers. Expensive but catches nuance.
Honest Limitations
No guardrail stack is 100% effective. Novel prompt injection techniques bypass classifiers trained on known patterns. Adversarial tokenization (Ch 5) evades text-based scanners. LLM-as-judge can be fooled. The goal is to raise the cost of attack and detect breaches quickly, not to achieve perfect prevention.
Coming Up
Ch 7: Securing RAG — Guardrails for retrieval-augmented generation pipelines

Ch 8: Securing Agents — Guardrails for tool-calling and autonomous agents

Ch 11: Red Teaming — Testing your guardrails with automated attack tools
Key insight: Guardrails are a necessary but insufficient defense. They must be combined with secure architecture (Ch 13), monitoring (Ch 14), and continuous red teaming (Ch 11). Treat them as seatbelts, not force fields.