RL with Process Rewards
// Outcome vs Process supervision
Outcome Supervision:
Solution: [S1, S2, S3, S4, S5]
Final answer: correct
Reward: [?, ?, ?, ?, +1]
// Which steps were good? Unknown!
Solution: [S1, S2, S3, S4, S5]
Final answer: wrong
Reward: [?, ?, ?, ?, -1]
// S1-S3 might be correct but
// all get penalized equally
Process Supervision:
Solution: [S1, S2, S3, S4, S5]
PRM scores: [+1, +1, +1, -1, -1]
// Error at S4! Reinforce S1-S3,
// penalize S4-S5
RL Efficiency:
Outcome: ~100K episodes to converge
Process: ~20K episodes to converge
// 5x more sample-efficient
Alignment Benefit:
Outcome: may reward wrong reasoning
that accidentally gets right answer
Process: rewards correct reasoning
regardless of final answer
// Aligned with human values