Guardrail Types
Guardrails are automated checks that validate LLM inputs and outputs before they reach the user. Input guardrails: detect and block prompt injection attempts, PII in user input, off-topic requests, and jailbreak attempts. Output guardrails: detect hallucinations (check claims against source documents), block toxic/harmful content, enforce format compliance (valid JSON, correct schema), and prevent data leakage (model revealing system prompt or training data). Tools: Guardrails AI (open-source, Python validators), NeMo Guardrails (NVIDIA, dialog-level safety), Lakera Guard (managed, prompt injection detection), and LLM-based validators (use a fast model to check the output of a powerful model).
Guardrails Example
# Guardrails AI example
from guardrails import Guard
from guardrails.hub import (
ToxicLanguage,
DetectPII,
ValidJSON,
)
guard = Guard().use_many(
ToxicLanguage(on_fail="fix"),
DetectPII(
pii_entities=["EMAIL", "PHONE", "SSN"],
on_fail="fix" # redact PII
),
ValidJSON(on_fail="reask"),
)
result = guard(
llm_api=openai.chat.completions.create,
model="gpt-4o",
messages=messages,
)
# result.validated_output → safe, clean output
# result.validation_passed → True/False
Key insight: Guardrails add latency (50–200ms per check). Use fast, deterministic checks (regex, schema validation) for every request, and reserve expensive LLM-based checks (hallucination detection) for high-risk outputs.