Ch 3 — Jailbreaking & Guardrail Bypass — Under the Hood
Crescendo, Skeleton Key, DAN, GCG — attack papers, success rates, and defense evasion
Under the Hood
-
Click play or press Space to begin. Click any node for deep-dive details...
AOne-Shot Jailbreak TechniquesDAN, role-play, encoding, payload smuggling
1person
DAN PromptDo Anything
Now persona
theater_comedy
Role-PlayFictional scenario
override
2code
EncodingBase64, ROT-13
leetspeak
local_shipping
Payload SmuggleHidden instructions
in context
3arrow_downward Escalation: one-shot → multi-turn attacks
BMulti-Turn Jailbreak StrategiesCrescendo, Skeleton Key — Microsoft Research 2024
chat
Benign StartInnocent opening
question
4trending_up
CrescendoGradual escalation
<5 turns
skeleton
Skeleton KeyComplete guardrail
disable
5lock_open
Bypass AchievedSafety training
overridden
6arrow_downward Gradient-based: optimized adversarial suffixes
CGradient-Based & Optimization AttacksGCG (Zou et al., 2023) — universal adversarial suffixes
functions
GCG AttackGreedy Coordinate
Gradient
7swap_horiz
TransferabilityVicuna → GPT-4
cross-model
auto_awesome
AutoDANAutomated suffix
generation
8arrow_downward Benchmarking: measuring jailbreak effectiveness
DJailbreak Benchmarks & Success RatesJailbreakBench, JailbreakRadar, JAILJUDGE
leaderboard
JailbreakBench100 behaviors
public leaderboard
radar
JailbreakRadar17 attacks × 9 LLMs
160 questions
gavel
JAILJUDGE35K+ examples
ASR 40% → 0.15%
9arrow_downward Defenses: detection, alignment, and layered mitigation
EDefenses & MitigationsDetection, instruction hierarchy, perplexity filtering
filter_alt
Perplexity FilterDetect gibberish
GCG suffixes
priority_high
Instruction HierarchyOpenAI Apr 2024
system > user
10layers
Layered DefenseCombine all
mitigations