The Analogy
Now we have a critic (reward model). PPO (Proximal Policy Optimization) is the training process: the model generates a response, the reward model scores it, and the model adjusts to produce higher-scoring responses. It’s like a student writing essays, getting grades from an AI teacher, and improving based on the feedback. The “proximal” part means: don’t change too much at once — stay close to the SFT model to avoid catastrophic forgetting.
Key insight: RLHF with PPO requires running four models simultaneously: (1) the policy model being trained, (2) a reference model (frozen SFT copy, for KL penalty), (3) the reward model, and (4) a value model (critic). For a 70B model, that’s ~280B parameters in memory. This extreme resource requirement is why DPO became popular — it eliminates models 3 and 4.
The PPO Loop
# RLHF with PPO (simplified):
for prompt in prompts:
# 1. Generate response
response = policy_model.generate(prompt)
# 2. Score with reward model
reward = reward_model(prompt + response)
# 3. KL penalty (don't drift too far)
kl = kl_divergence(policy_model, ref_model)
adjusted_reward = reward - β * kl
# β ≈ 0.01-0.1 (controls drift)
# 4. PPO update
advantage = adjusted_reward - value_model(state)
ratio = policy_new / policy_old
clipped = torch.clamp(ratio, 1-ε, 1+ε)
loss = -torch.min(ratio*advantage,
clipped*advantage)
loss.backward()
optimizer.step()
# Models in memory simultaneously:
# 1. Policy (being trained): 8B
# 2. Reference (frozen): 8B
# 3. Reward model: 8B
# 4. Value model: 8B
# Total: ~32B params in memory!