Bypassing a model's safety alignment to elicit prohibited, harmful, or toxic outputs.
- Techniques: Attackers use role-play (e.g., "DAN"), hypothetical scenarios, or encoded payloads (Base64) to trick the model into ignoring its safety training.
- Many-Shot Jailbreaks: Overwhelming the model's context window with hundreds of fake "successful" malicious interactions to normalize bad behavior.