How PPO Works for LLMs
Four models in memory simultaneously:
1. Policy model (active): The LLM being optimized. Generates responses.
2. Reference model (frozen): A copy of the SFT model. Used to compute KL divergence (how far has the policy drifted?).
3. Reward model (frozen): Scores the generated responses.
4. Value model (active): Estimates expected future reward. Used to compute advantages for PPO.
This means PPO requires 4x the memory of SFT. For a 7B model: ~360 GB of GPU memory. This is why RLHF is expensive.
The PPO Training Loop
For each batch of prompts:
1. Generate: Policy model generates responses
2. Score: Reward model scores each response
3. Compute advantage: reward - value_estimate - KL_penalty
4. Update policy: PPO clipped objective (prevent large updates)
5. Update value model: Fit to observed rewards
KL penalty: reward_final = reward - beta * KL(policy || reference)
Typical beta: 0.01-0.2. Higher beta = more conservative (stays closer to SFT model).
# PPO objective (simplified)
reward = reward_model(prompt, response)
kl = log_prob_policy - log_prob_reference
penalized_reward = reward - beta * kl
# PPO clipped objective
ratio = exp(log_prob_new - log_prob_old)
clipped = clip(ratio, 1-eps, 1+eps)
loss = -min(ratio * advantage,
clipped * advantage)
PPO is notoriously unstable for LLMs. Common issues: reward hacking (model finds exploits in the reward model), mode collapse (model generates only one type of response), training instability (loss spikes). This complexity is a major motivation for simpler alternatives like DPO (Chapter 7).