Ch 3 — Jailbreaking & Guardrail Bypass — Under the Hood

Crescendo, Skeleton Key, DAN, GCG — attack papers, success rates, and defense evasion
Under the Hood
-
Click play or press Space to begin. Click any node for deep-dive details...
Step- / 10
AOne-Shot Jailbreak TechniquesDAN, role-play, encoding, payload smuggling
1
person
DAN PromptDo Anything
Now persona
theater_comedy
Role-PlayFictional scenario
override
2
code
EncodingBase64, ROT-13
leetspeak
local_shipping
Payload SmuggleHidden instructions
in context
3
arrow_downward Escalation: one-shot → multi-turn attacks
BMulti-Turn Jailbreak StrategiesCrescendo, Skeleton Key — Microsoft Research 2024
chat
Benign StartInnocent opening
question
4
trending_up
CrescendoGradual escalation
<5 turns
skeleton
Skeleton KeyComplete guardrail
disable
5
lock_open
Bypass AchievedSafety training
overridden
6
arrow_downward Gradient-based: optimized adversarial suffixes
CGradient-Based & Optimization AttacksGCG (Zou et al., 2023) — universal adversarial suffixes
functions
GCG AttackGreedy Coordinate
Gradient
7
swap_horiz
TransferabilityVicuna → GPT-4
cross-model
auto_awesome
AutoDANAutomated suffix
generation
8
arrow_downward Benchmarking: measuring jailbreak effectiveness
DJailbreak Benchmarks & Success RatesJailbreakBench, JailbreakRadar, JAILJUDGE
leaderboard
JailbreakBench100 behaviors
public leaderboard
radar
JailbreakRadar17 attacks × 9 LLMs
160 questions
gavel
JAILJUDGE35K+ examples
ASR 40% → 0.15%
9
arrow_downward Defenses: detection, alignment, and layered mitigation
EDefenses & MitigationsDetection, instruction hierarchy, perplexity filtering
filter_alt
Perplexity FilterDetect gibberish
GCG suffixes
priority_high
Instruction HierarchyOpenAI Apr 2024
system > user
10
layers
Layered DefenseCombine all
mitigations