Ch 6: Input Guardrails & Output Filtering

Ch 6 — Input Guardrails & Output Filtering

NeMo Guardrails, LLM Guard, Lakera Guard, OpenAI Moderation — the defensive toolchain

Index Under the Hood →

High Level

person

User Input

arrow_forward

filter_alt

Input Guard

arrow_forward

smart_toy

LLM

arrow_forward

verified_user

Output Guard

arrow_forward

check_circle

Safe Response

Click play or press Space to begin the journey...

Step- / 7

shield

Why Guardrails: Runtime Defense for LLMs

The model can’t protect itself — external layers must enforce safety

The Problem

Chapters 2–5 showed that LLMs are vulnerable to prompt injection, jailbreaking, and adversarial inputs. Safety training (RLHF) helps but is not sufficient — Anthropic’s Sleeper Agents paper proved backdoors survive it. The model cannot reliably police itself. Guardrails are external software layers that inspect inputs before they reach the LLM and filter outputs before they reach the user.

Two-Sided Defense

Input guardrails — Detect and block prompt injection, jailbreak attempts, toxic content, PII, and off-topic requests before the LLM processes them

Output guardrails — Scan LLM responses for harmful content, data leakage, hallucinated facts, system prompt leakage, and policy violations before the user sees them

# The guardrail pattern user_input = get_user_message() # 1. Input guard if input_guard.scan(user_input).flagged: return "Request blocked" # 2. LLM generates response response = llm.generate(user_input) # 3. Output guard if output_guard.scan(response).flagged: return "Response filtered" # 4. Safe to deliver return response

Defense in depth: No single guardrail catches everything. Production systems layer multiple tools: a fast regex/classifier for obvious attacks, an LLM-as-judge for nuanced cases, and canary tokens for exfiltration detection.

code

NVIDIA NeMo Guardrails

Open-source programmable rails with Colang 2.0 — v0.10.1 (beta)

What It Is

NeMo Guardrails is NVIDIA’s open-source toolkit for adding programmable guardrails to LLM applications. It uses Colang 2.0, a Python-like domain-specific language for defining conversational rails. You write flows that control what the bot can discuss, enforce dialog paths, and validate inputs/outputs — all as code, not just prompts.

Colang 2.0 Architecture

Colang 2.0 (available in NeMo Guardrails ≥0.9) is a complete rewrite from 1.0. Key features: parallel flow execution, a standard library, async actions, a generation operator (...), and an explicit main flow entry point. It supports input rails, output rails, dialog rails, and multimodal rails.

# NeMo Guardrails — Colang 2.0 example define flow main activate greeting activate input_safety_check activate output_safety_check define flow input_safety_check when user said something if contains_injection(user_message): bot say "I can't process that request." abort define flow output_safety_check when bot said something if contains_pii(bot_message): bot say "[Response redacted for safety]"

Strength: Programmable, auditable, version-controlled safety logic. Tradeoff: Requires learning Colang and adds latency from the flow engine. Best for applications needing deterministic, policy-driven control over conversations.

security

LLM Guard by Protect AI

Open-source scanner library — MIT license, 2,680+ GitHub stars

Scanner-Based Architecture

LLM Guard takes a different approach from NeMo: instead of a flow language, it provides a library of composable scanners that you chain into a pipeline. Each scanner checks for one specific threat. Input scanners protect prompts; output scanners protect responses. Install with pip install llm-guard. Requires Python ≥3.9.

Available Scanners

Input: Anonymize, BanCode, BanSubstrings, BanTopics, Gibberish detection, Invisible text detection, Language detection, Prompt injection detection, Regex patterns, Secrets detection

Output: Deanonymize, NoRefusal, Relevance, Sensitive content, Bias detection, Regex filtering

# LLM Guard — scanner pipeline from llm_guard import scan_prompt, scan_output from llm_guard.input_scanners import ( PromptInjection, Secrets, Anonymize ) from llm_guard.output_scanners import ( Sensitive, NoRefusal ) input_scanners = [ PromptInjection(), Secrets(), Anonymize(), ] sanitized, results, valid = scan_prompt( input_scanners, prompt ) if not valid: return "Blocked"

Strength: Modular, open-source (MIT), easy to integrate. Each scanner is independent — add or remove as needed. Tradeoff: Scanner quality varies; prompt injection detection relies on classifier models that can be bypassed by novel attacks.

cloud

Lakera Guard: API-Based Detection

Real-time threat detection across 100+ languages — updated daily

How It Works

Lakera Guard is a hosted API service (not a library you run locally). You send user inputs, reference documents, and model outputs to the /v2/guard endpoint, and it returns flagged: true/false with detected threat categories. It screens for prompt injection, jailbreaks, data leakage, and inappropriate content across 100+ languages and scripts.

Threat Intelligence

Lakera’s detection models are updated daily using threat intelligence from analyzing 100K+ Gandalf attacks per day (Gandalf is Lakera’s public prompt injection challenge). This gives them a continuously evolving dataset of real attack patterns, not just academic examples.

# Lakera Guard — API call import requests response = requests.post( "https://api.lakera.ai/v2/guard", headers={ "Authorization": f"Bearer {API_KEY}", }, json={ "messages": [{ "role": "user", "content": user_input }], "project_id": "my-project" } ) if response.json()["flagged"]: return "Threat detected"

Strength: Ultra-low latency, multilingual, continuously updated from real attacks. Tradeoff: Hosted service (data leaves your infrastructure), API dependency, cost at scale. Best for teams that want managed security without running their own models.

gavel

OpenAI Moderation & Guardrails AI

Provider-side safety + structured output validation

OpenAI’s Built-In Defenses

OpenAI provides two complementary mechanisms. The Moderation API (free, using omni-moderation-latest built on GPT-4o) classifies text and images across categories: hate, violence, self-harm, sexual content, harassment, and illicit activities. It supports 40+ languages with a 42% accuracy improvement over the previous model. Separately, the Instruction Hierarchy (April 2024) trains models to prioritize system prompts over user instructions when they conflict.

Guardrails AI: Output Validation

Guardrails AI focuses on structured output validation rather than safety scanning. It uses Pydantic-based validators to ensure LLM outputs match expected schemas. Validators return PassResult or FailResult, with configurable on_fail policies (fix, re-ask, filter). Useful for enforcing JSON structure, regex patterns, and business rules on LLM outputs.

# OpenAI Moderation API response = openai.moderations.create( model="omni-moderation-latest", input=llm_output ) if response.results[0].flagged: return "Content policy violation" # Guardrails AI — structured validation from guardrails import Guard from pydantic import BaseModel class SafeResponse(BaseModel): answer: str confidence: float sources: list[str] guard = Guard.for_pydantic(SafeResponse) result = guard(llm.generate, prompt=query) # Validates structure + field constraints

Different roles: OpenAI Moderation catches harmful content. Guardrails AI enforces output structure. They solve different problems and can be layered together.

nest_cam_wired_stand

Canary Tokens & LLM-as-Judge

Exfiltration detection and nuanced safety evaluation

Canary Tokens

A canary token is a cryptographically random string injected into the system prompt at the start of each interaction. If the LLM’s response contains the canary, it means the model leaked its system prompt — likely due to a prompt injection attack (OWASP LLM07:2025). The check is a simple exact string match on the output, making it fast and reliable. LangChain4j has implemented this as a built-in guardrail feature.

LLM-as-Judge

For nuanced safety decisions that regex and classifiers can’t handle, a second LLM evaluates the first LLM’s output. The judge LLM is prompted with safety criteria and returns a pass/fail verdict. This catches subtle policy violations that pattern matching misses.

LLM-as-Judge Limitations

Research shows significant weaknesses: small stylistic changes in outputs can cause false negative rates to jump by up to 0.24. Adversarial attacks can fool judges into misclassifying 100% of harmful generations as safe. Judges exhibit prompt sensitivity and distribution shifts between benchmarks and real-world deployment. Source: Eiras et al., ICML 2025 “Know Thy Judge.”

The tradeoff: LLM-as-judge catches things classifiers miss, but adds latency, cost, and its own attack surface. A compromised judge LLM is worse than no judge at all. Human review remains necessary for high-stakes decisions.

layers

Layered Defense & Honest Limitations

No single guardrail is sufficient — combine, layer, and accept the gaps

The Production Stack

Layer 1 — Fast classifiers: Regex, keyword blocklists, LLM Guard scanners. Catches obvious attacks in <10ms.

Layer 2 — ML classifiers: Lakera Guard or fine-tuned injection detectors. Catches sophisticated attacks in 50–200ms.

Layer 3 — Programmable rails: NeMo Guardrails Colang flows for dialog control and topic enforcement.

Layer 4 — Output validation: Guardrails AI for structure, OpenAI Moderation for content, canary tokens for leakage.

Layer 5 — LLM-as-judge: For edge cases that escape all other layers. Expensive but catches nuance.

Honest Limitations

No guardrail stack is 100% effective. Novel prompt injection techniques bypass classifiers trained on known patterns. Adversarial tokenization (Ch 5) evades text-based scanners. LLM-as-judge can be fooled. The goal is to raise the cost of attack and detect breaches quickly, not to achieve perfect prevention.

Coming Up

Ch 7: Securing RAG — Guardrails for retrieval-augmented generation pipelines

Ch 8: Securing Agents — Guardrails for tool-calling and autonomous agents

Ch 11: Red Teaming — Testing your guardrails with automated attack tools

Key insight: Guardrails are a necessary but insufficient defense. They must be combined with secure architecture (Ch 13), monitoring (Ch 14), and continuous red teaming (Ch 11). Treat them as seatbelts, not force fields.