Ch 6 — Alignment: RLHF & Reward Models

Why alignment matters, the RLHF pipeline, reward model training, PPO, and the InstructGPT recipe
High Level
psychology
Why
arrow_forward
route
Pipeline
arrow_forward
thumb_up
Preference
arrow_forward
star
Reward
arrow_forward
model_training
PPO
arrow_forward
warning
Challenges
arrow_forward
history
History
-
Click play or press Space to begin the journey...
Step- / 7
psychology
Why Alignment Matters
The gap between a capable model and a helpful, harmless, honest one
The Problem
A pre-trained LLM is a powerful text predictor, but it has no concept of being helpful, harmless, or honest (the "HHH" criteria from Anthropic). It will happily:

- Generate toxic or harmful content
- Make up facts with high confidence
- Follow dangerous instructions
- Produce verbose, unhelpful responses
- Refuse to follow reasonable instructions

SFT (supervised fine-tuning) teaches the model the format of helpful responses, but it doesn't teach the model to prefer good responses over bad ones. The model learns to mimic the training data, not to optimize for quality.
What Alignment Does
Alignment is the process of making a model's behavior match human intentions and values. It goes beyond SFT by teaching the model to:

- Prefer helpful responses over unhelpful ones
- Refuse harmful requests appropriately
- Acknowledge uncertainty instead of hallucinating
- Follow instructions precisely
- Be concise when brevity is appropriate

The key insight: alignment requires preference data (which response is better?) rather than just demonstration data (what is a good response?).
The alignment tax: Alignment typically makes models slightly worse at raw benchmarks (e.g., MMLU) but dramatically better at real-world usefulness. This is a worthwhile trade-off. A model that scores 85% on MMLU but follows instructions well is more useful than one that scores 90% but ignores instructions.
route
The RLHF Pipeline
Three stages from base model to aligned model
Stage 1: Supervised Fine-Tuning (SFT)
Start with a pre-trained base model. Fine-tune it on high-quality instruction-response pairs to teach the format and style of helpful responses.

Data: 10K-100K curated (prompt, response) pairs
Result: A model that can follow instructions but doesn't yet distinguish good from great responses
Stage 2: Reward Model Training
Train a separate model to predict human preferences. Given a prompt and two responses, the reward model outputs which response is better.

Data: 50K-500K (prompt, chosen, rejected) triples
Result: A reward model that scores any response on a quality scale
Stage 3: RL Optimization (PPO)
Use the reward model as a scoring function. The SFT model generates responses, the reward model scores them, and PPO (Proximal Policy Optimization) updates the model to generate higher-scoring responses.

Key constraint: A KL-divergence penalty prevents the model from drifting too far from the SFT model (to avoid reward hacking).

Result: An aligned model that generates responses humans prefer
The InstructGPT recipe (Ouyang et al., 2022): Base model (GPT-3) ↓ SFT on 13K demonstrations SFT model ↓ Train reward model on 33K comparisons Reward model ↓ PPO with 31K prompts InstructGPT (preferred over 175B GPT-3)
InstructGPT (Ouyang et al., 2022, OpenAI) was the landmark paper that proved RLHF works at scale. A 1.3B InstructGPT model was preferred by humans over the 175B GPT-3 base model. This showed that alignment is more important than raw scale for user-facing applications.
thumb_up
Preference Data Collection
How human preferences are gathered and structured
The Comparison Format
Human annotators are shown a prompt and two (or more) model-generated responses. They choose which response is better, or rank them. This creates preference pairs:

Format: (prompt, chosen_response, rejected_response)

Example:
Prompt: "Explain quantum entanglement simply"
Chosen: "Quantum entanglement is when two particles become linked so that measuring one instantly affects the other, no matter the distance..."
Rejected: "Quantum entanglement is a phenomenon described by the mathematical formalism of quantum mechanics involving the tensor product of Hilbert spaces..."

The chosen response is simpler and more helpful for the given prompt.
Data Collection Methods
Human annotation (gold standard):
- Hire trained annotators (OpenAI used ~40 contractors)
- Provide detailed guidelines on what "better" means
- Inter-annotator agreement is typically 65-75%
- Cost: $1-$5 per comparison

AI feedback (scalable alternative):
- Use a strong model (GPT-4, Claude) as the judge
- "Constitutional AI" (Bai et al., 2022, Anthropic): the model critiques and revises its own outputs based on a set of principles
- RLAIF: RL from AI Feedback
- Much cheaper but may inherit biases from the judge model
Public Preference Datasets
Anthropic HH-RLHF: ~170K human preference pairs for helpfulness and harmlessness
OpenAssistant OASST1: ~88K messages in 35 languages with rankings
UltraFeedback: ~64K prompts with GPT-4 preference annotations
Nectar: ~183K comparisons from 7 models, ranked by GPT-4
Data quality matters enormously. Noisy preference labels (low inter-annotator agreement) lead to a weak reward model, which leads to poor alignment. Investing in clear annotation guidelines and annotator training pays off more than collecting more data.
star
Reward Model Training
Teaching a model to score response quality
Architecture
A reward model is typically the same architecture as the LLM but with the language modeling head replaced by a scalar output head (a single linear layer that outputs one number).

Common approach: Start from the SFT model checkpoint, replace the LM head with a value head. This gives the reward model a strong understanding of language before it learns to score quality.

Input: (prompt + response) concatenated
Output: A single scalar reward score
Training Objective
The reward model is trained with the Bradley-Terry preference model:

Given a prompt x, chosen response y_w, and rejected response y_l:

Loss = -log(sigmoid(r(x, y_w) - r(x, y_l)))

This pushes the reward model to assign a higher score to the chosen response than the rejected one. The sigmoid ensures the loss is bounded and well-behaved.
Reward Model Quality
Accuracy metric: On held-out preference pairs, how often does the reward model agree with the human label? Typical accuracy: 65-75% (matching human inter-annotator agreement).

Reward model size: Usually the same size as the policy model or smaller. OpenAI used a 6B reward model for InstructGPT. Anthropic uses reward models of similar size to the policy.

Overoptimization risk: If the policy model optimizes too hard against the reward model, it finds adversarial inputs that score high but are actually bad (reward hacking). This is why KL penalty is essential.
The reward model is the bottleneck of RLHF. A bad reward model means the RL step optimizes for the wrong thing. Common failure modes: (1) reward model is too small and can't capture nuanced preferences, (2) training data has inconsistent labels, (3) reward model overfits to surface features (length, formatting) rather than actual quality.
model_training
PPO: Proximal Policy Optimization
The RL algorithm that updates the language model
How PPO Works for LLMs
Four models in memory simultaneously:

1. Policy model (active): The LLM being optimized. Generates responses.

2. Reference model (frozen): A copy of the SFT model. Used to compute KL divergence (how far has the policy drifted?).

3. Reward model (frozen): Scores the generated responses.

4. Value model (active): Estimates expected future reward. Used to compute advantages for PPO.

This means PPO requires 4x the memory of SFT. For a 7B model: ~360 GB of GPU memory. This is why RLHF is expensive.
The PPO Training Loop
For each batch of prompts:

1. Generate: Policy model generates responses
2. Score: Reward model scores each response
3. Compute advantage: reward - value_estimate - KL_penalty
4. Update policy: PPO clipped objective (prevent large updates)
5. Update value model: Fit to observed rewards

KL penalty: reward_final = reward - beta * KL(policy || reference)
Typical beta: 0.01-0.2. Higher beta = more conservative (stays closer to SFT model).
# PPO objective (simplified) reward = reward_model(prompt, response) kl = log_prob_policy - log_prob_reference penalized_reward = reward - beta * kl # PPO clipped objective ratio = exp(log_prob_new - log_prob_old) clipped = clip(ratio, 1-eps, 1+eps) loss = -min(ratio * advantage, clipped * advantage)
PPO is notoriously unstable for LLMs. Common issues: reward hacking (model finds exploits in the reward model), mode collapse (model generates only one type of response), training instability (loss spikes). This complexity is a major motivation for simpler alternatives like DPO (Chapter 7).
warning
RLHF Challenges
Why RLHF is hard and what can go wrong
Reward Hacking
The policy model finds ways to get high reward scores without actually being helpful. Examples:

- Length gaming: Longer responses often get higher rewards, so the model becomes excessively verbose
- Sycophancy: The model learns to agree with the user regardless of correctness, because agreeable responses were preferred in training
- Format gaming: The model produces responses with bullet points, headers, and formatting that the reward model likes, even when plain text would be better
- Hedging: The model adds excessive caveats and disclaimers to avoid being wrong
Computational Cost
4 models in memory: Policy + reference + reward + value model. For a 7B model, this requires 8-16 A100 GPUs.

Generation during training: Each PPO step requires generating full responses, which is much slower than forward/backward passes in SFT.

Typical RLHF cost: 5-20x more expensive than SFT for the same model.
Training Instability
Hyperparameter sensitivity: PPO has many hyperparameters (KL coefficient, clip range, learning rate, batch size, number of PPO epochs per batch). Small changes can cause training to diverge.

Reward model quality ceiling: The aligned model can only be as good as the reward model. If the reward model has blind spots, the policy will exploit them.

Evaluation difficulty: There's no simple metric for "alignment quality." Human evaluation is expensive and subjective. Automated benchmarks (MT-Bench, AlpacaEval) are imperfect proxies.
These challenges drove the development of DPO (Direct Preference Optimization, Rafailov et al., 2023). DPO eliminates the reward model and PPO entirely, directly optimizing the policy on preference data. It's simpler, more stable, and cheaper. We cover DPO in detail in Chapter 7.
history
Alignment History & Timeline
Key papers and milestones in RLHF
Key Papers
2017 — "Learning from Human Preferences" (Christiano et al., OpenAI): First demonstration of RLHF for deep RL agents. Showed humans could train agents by comparing trajectory segments.

2020 — "Learning to summarize from human feedback" (Stiennon et al., OpenAI): Applied RLHF to text summarization. Trained a reward model on human comparisons, then used PPO to optimize a summarization model.

2022 — "Training language models to follow instructions with human feedback" (InstructGPT) (Ouyang et al., OpenAI): The paper that launched ChatGPT. Showed RLHF works at scale for general instruction following.

2022 — "Constitutional AI" (CAI) (Bai et al., Anthropic): Replaced human feedback with AI feedback based on a set of principles ("constitution"). The foundation of Claude.
The Shift Away from PPO
2023 — "Direct Preference Optimization" (DPO) (Rafailov et al., Stanford): Showed that RLHF can be reformulated as a simple classification loss, eliminating the reward model and PPO. Became the dominant alignment method.

2024 — ORPO, SimPO, KTO: Further simplifications. ORPO combines SFT and alignment in one step. SimPO removes the reference model. KTO works with binary feedback (good/bad) instead of comparisons.

Current state (2025): Most open-source models use DPO or its variants. PPO-based RLHF is still used by frontier labs (OpenAI, Anthropic, Google) where the extra complexity is justified by the scale and stakes.
The trend is toward simplification. RLHF with PPO requires 4 models and is hard to tune. DPO requires 2 models and is a simple loss function. ORPO requires 1 model and combines SFT + alignment. Each generation is simpler, cheaper, and often just as effective. But PPO remains the gold standard for maximum control at frontier scale.