Ch 8 — RLHF & Alignment

Teaching LLMs to be helpful, harmless, and honest — reward models, PPO, DPO, and Constitutional AI
Alignment
help
Why
arrow_forward
thumb_up
Preferences
arrow_forward
star
Reward
arrow_forward
model_training
PPO
arrow_forward
bolt
DPO
arrow_forward
gavel
CAI
arrow_forward
security
Safety
arrow_forward
landscape
Frontier
-
Click play or press Space to begin...
Step- / 8
help
Why Alignment Matters
SFT teaches format, but not quality or safety
The Analogy
SFT (Ch 7) taught the model to respond to questions. But it didn’t teach it to respond well. It’s like teaching someone to write emails — they know the format, but their emails might be rude, inaccurate, or overly verbose. Alignment is the finishing school: it teaches the model to be helpful (give good answers), harmless (refuse dangerous requests), and honest (acknowledge uncertainty). This is what makes ChatGPT feel polished.
Key insight: The “alignment problem” is fundamentally about making AI systems do what humans want, not just what they’re literally told. A model trained only on next-token prediction optimizes for “what text comes next?” — not “what response would a human prefer?” RLHF bridges this gap by directly optimizing for human preferences.
The Three H’s
# Anthropic's "HHH" framework: # Helpful: provide useful, accurate answers # "What's the weather?" → actual forecast # Not: vague, unhelpful, or off-topic # Harmless: refuse dangerous requests # "How to make a bomb?" → polite refusal # Not: detailed instructions # Honest: acknowledge limitations # "I'm not sure about that" when uncertain # Not: confident hallucinations # The training pipeline: # Pretraining → knowledge (Ch 6) # SFT → format (Ch 7) # RLHF/DPO → quality + safety (this chapter) # Without alignment: # Model can be helpful but also harmful # Model can be confident but also wrong # Model can follow instructions... any instructions
thumb_up
Human Preference Data
Collecting “which response is better?” judgments
The Analogy
Imagine training a chef. Instead of giving them recipes (SFT), you have food critics taste two dishes and pick the better one. Over thousands of comparisons, the chef learns what “good food” means. RLHF works the same way: for each prompt, the model generates two responses, and a human annotator picks which one is better. These preference pairs become the training signal.
Key insight: It’s much easier for humans to compare two responses than to write a perfect response from scratch. “Is A or B better?” is a simpler task than “Write the ideal answer.” This is why preference-based training is so effective — it leverages the easiest form of human judgment. InstructGPT used ~33K preference comparisons from a team of ~40 annotators.
Preference Collection
# Preference data format: { "prompt": "Explain black holes simply", "chosen": "A black hole is a region in space where gravity is so strong that nothing, not even light, can escape...", "rejected": "Black holes are fascinating astronomical phenomena that have captivated scientists for decades. The concept was first..." } # "chosen" is preferred: concise, direct # "rejected" is worse: verbose, off-topic # Annotation guidelines (typical): # - Prefer helpful, accurate responses # - Penalize hallucinations # - Penalize harmful content # - Prefer concise over verbose # - Prefer structured over rambling # Dataset sizes: # InstructGPT: ~33K comparisons # Llama 2: ~1M+ comparisons # Anthropic HH: ~170K comparisons
star
The Reward Model: Learning What Humans Want
A neural network that scores response quality
The Analogy
You can’t have a human judge every response during training (too slow, too expensive). Instead, you train a reward model — a separate neural network that learns to predict which response a human would prefer. It’s like training an AI food critic: show it thousands of “A is better than B” judgments, and it learns to score any dish on its own. The reward model is typically a copy of the LLM with a scalar output head.
Key insight: The reward model is trained using the Bradley-Terry model from statistics: P(A > B) = σ(r(A) − r(B)), where r is the reward score and σ is the sigmoid function. This converts pairwise comparisons into a scalar reward. The loss is: L = −log(σ(r_chosen − r_rejected)). Sound familiar? It’s similar to cross-entropy, but over pairs instead of individual tokens.
Reward Model Training
# Reward model: LLM + scalar head class RewardModel(nn.Module): def __init__(self, base_model): super().__init__() self.base = base_model # e.g., Llama 8B self.head = nn.Linear(d_model, 1) def forward(self, input_ids): hidden = self.base(input_ids) reward = self.head(hidden[:, -1, :]) return reward.squeeze() # Bradley-Terry loss: def reward_loss(r_chosen, r_rejected): return -torch.log( torch.sigmoid(r_chosen - r_rejected) ).mean() # Training: # For each (prompt, chosen, rejected): # r_w = reward_model(prompt + chosen) # r_l = reward_model(prompt + rejected) # loss = -log(σ(r_w - r_l)) # → Push r_chosen > r_rejected
model_training
PPO: Optimizing with Reinforcement Learning
Using the reward model to improve the LLM
The Analogy
Now we have a critic (reward model). PPO (Proximal Policy Optimization) is the training process: the model generates a response, the reward model scores it, and the model adjusts to produce higher-scoring responses. It’s like a student writing essays, getting grades from an AI teacher, and improving based on the feedback. The “proximal” part means: don’t change too much at once — stay close to the SFT model to avoid catastrophic forgetting.
Key insight: RLHF with PPO requires running four models simultaneously: (1) the policy model being trained, (2) a reference model (frozen SFT copy, for KL penalty), (3) the reward model, and (4) a value model (critic). For a 70B model, that’s ~280B parameters in memory. This extreme resource requirement is why DPO became popular — it eliminates models 3 and 4.
The PPO Loop
# RLHF with PPO (simplified): for prompt in prompts: # 1. Generate response response = policy_model.generate(prompt) # 2. Score with reward model reward = reward_model(prompt + response) # 3. KL penalty (don't drift too far) kl = kl_divergence(policy_model, ref_model) adjusted_reward = reward - β * kl # β ≈ 0.01-0.1 (controls drift) # 4. PPO update advantage = adjusted_reward - value_model(state) ratio = policy_new / policy_old clipped = torch.clamp(ratio, 1-ε, 1+ε) loss = -torch.min(ratio*advantage, clipped*advantage) loss.backward() optimizer.step() # Models in memory simultaneously: # 1. Policy (being trained): 8B # 2. Reference (frozen): 8B # 3. Reward model: 8B # 4. Value model: 8B # Total: ~32B params in memory!
bolt
DPO: Direct Preference Optimization
Skip the reward model entirely — learn directly from preferences
The Analogy
PPO is like hiring a food critic (reward model), then using their scores to train the chef. DPO asks: why not train the chef directly from the taste tests? Rafailov et al. (2023) proved mathematically that you can derive the optimal policy directly from preference data, without ever training a separate reward model. It’s simpler, more stable, and requires half the GPU memory.
Key insight: DPO’s key equation reparameterizes the reward as: r(x,y) = β · log(π(y|x) / π_ref(y|x)) + const. This means the reward is implicit in the ratio of the trained model’s probability to the reference model’s probability. No separate reward model needed. DPO is now the most popular alignment method for open-source models (Llama 3, Mistral, Zephyr).
DPO Implementation
# DPO loss (the entire algorithm): def dpo_loss(policy, ref, chosen, rejected, β): # Log probs from policy model π_w = policy.log_prob(chosen) π_l = policy.log_prob(rejected) # Log probs from reference (frozen) ref_w = ref.log_prob(chosen) ref_l = ref.log_prob(rejected) # DPO objective logits = β * ((π_w - ref_w) - (π_l - ref_l)) return -F.logsigmoid(logits).mean() # That's it! No reward model, no PPO, # no value function, no clipping. # Comparison: # PPO: 4 models, complex, unstable # DPO: 2 models, simple, stable # Models needed: # 1. Policy (being trained): 8B # 2. Reference (frozen): 8B # Total: ~16B params (half of PPO!) # Used by: Llama 3, Zephyr, Mistral, # Intel Neural Chat, many open models
gavel
Constitutional AI: Rules Instead of Humans
Anthropic’s approach — self-critique guided by principles
The Analogy
Instead of hiring thousands of human judges, what if you gave the model a rulebook (constitution) and asked it to judge itself? Constitutional AI (Bai et al., 2022, Anthropic) does exactly this. Phase 1: the model generates a response, then critiques and revises it against principles like “Choose the response that is least harmful.” Phase 2: use AI-generated preferences (RLAIF) instead of human preferences to train the reward model.
Key insight: CAI’s constitution includes principles from the UN Declaration of Human Rights, Apple’s Terms of Service, research on AI safety, and Anthropic’s internal guidelines. The model essentially becomes its own alignment teacher. This scales much better than human annotation and can be updated by changing the rules rather than collecting new data. Claude is built with Constitutional AI.
The CAI Process
# Constitutional AI (two phases): # Phase 1: Supervised Self-Critique # 1. Model generates response to harmful prompt # 2. Model critiques its own response: # "Does this response violate the principle: # 'Choose the response that is most # supportive and encouraging'?" # 3. Model revises based on critique # 4. Train on revised responses (SFT) # Phase 2: RLAIF (RL from AI Feedback) # 1. Generate pairs of responses # 2. AI judges which is better (using rules) # 3. Train reward model on AI preferences # 4. Run PPO/DPO with AI-trained reward model # Example constitutional principles: # - "Choose the response that is least harmful" # - "Choose the response that is most honest" # - "Choose the response that best acknowledges # its own limitations" # - "Choose the response that is least likely # to be used for illegal activities"
security
Safety Training: Red-Teaming and Guardrails
Testing and hardening models against misuse
The Analogy
Before a bank opens, it hires people to try to break in (penetration testing). Red-teaming does the same for LLMs: teams of experts try to make the model produce harmful content, leak private data, or bypass safety filters. Every successful attack becomes a training example. Llama 2’s safety paper describes extensive red-teaming across categories: violence, illegal activities, hate speech, and more.
Key insight: Safety is an arms race. Jailbreaks (prompt injection attacks that bypass safety training) are constantly discovered and patched. Common techniques: role-playing (“Pretend you’re an evil AI”), encoding (Base64, ROT13), multi-turn escalation, and many-shot prompting. Models are never perfectly safe — alignment is about raising the bar, not achieving perfection. This connects to AI Security (a whole field of study).
Safety Techniques
# Safety training pipeline: # 1. Red-teaming (pre-deployment) # - Internal teams try to break the model # - External bounty programs # - Automated adversarial testing # - Categories: violence, CSAM, weapons, # illegal activity, PII, bias # 2. Safety SFT # - Train on (harmful_prompt, refusal) pairs # - "How to hack a server?" → polite refusal # - Balance: helpful but not harmful # 3. Safety RLHF # - Separate safety reward model # - Penalizes harmful outputs heavily # - Llama 2: safety RM + helpfulness RM # 4. System-level guardrails # - Input classifiers (detect harmful prompts) # - Output classifiers (detect harmful outputs) # - Llama Guard: safety classifier model # - Content filters, rate limiting
landscape
The Alignment Frontier
Where alignment research is heading
Current State
The field has evolved rapidly: RLHF (2022) proved alignment works. DPO (2023) made it simpler. Constitutional AI reduced human annotation needs. RLAIF uses AI feedback at scale. Newer methods like ORPO (Odds Ratio Preference Optimization) and KTO (Kahneman-Tversky Optimization) simplify further. The trend: simpler algorithms, less human labor, better results.
Key insight: A 2024 ICML study found that when properly tuned, PPO still outperforms DPO on many benchmarks. But DPO’s simplicity makes it the practical choice for most teams. The real frontier is scalable oversight: how do you align models that are smarter than the humans evaluating them? This is an open research problem that becomes more urgent as models improve.
Methods Comparison
# Alignment methods evolution: # RLHF + PPO (2022, InstructGPT) # ✓ Proven at scale (ChatGPT, GPT-4) # ✗ Complex (4 models), unstable # ✗ Expensive (reward model + RL) # DPO (2023, Rafailov et al.) # ✓ Simple (2 models), stable # ✓ No reward model needed # ✗ May underperform PPO on some tasks # Constitutional AI (2022, Anthropic) # ✓ Scalable (AI feedback, not human) # ✓ Updatable (change rules, not data) # ✗ Requires strong base model # ORPO (2024): no reference model needed # KTO (2024): works with binary feedback # GRPO (2024, DeepSeek): group-relative PPO # The trend: simpler, cheaper, better # From 4 models → 2 models → 1 model