Ch 6 — RLHF & Reward Models — Under the Hood

Preference data format, reward model training code, PPO with TRL, and practical RLHF recipes
Under the Hood
-
Click play or press Space to begin...
Step- / 10
APreference Data Format & LoadingHow preference datasets are structured for training
1
dataset
Data Format
prompt/chosen/rejected
load
2
transform
Preprocessing
Tokenize pairs
3
compareBradley-Terry model: P(y_w > y_l) = sigmoid(r(y_w) - r(y_l))
BReward Model TrainingArchitecture, loss function, and training with TRL
3
architecture
RM Architecture
LLM + value head
train
4
code
RewardTrainer
TRL library
eval
5
assessment
RM Evaluation
Accuracy metrics
6
syncPPO loop: generate → score → compute advantage → update policy
CPPO Training with TRLThe four-model setup and training loop
6
psychology
Policy Model
Active, generates
KL
7
lock
Reference Model
Frozen SFT copy
score
7
star
Reward Model
Frozen, scores
DPPO Implementation CodeComplete PPO training script with TRL PPOTrainer
8
code
PPO Script
PPOTrainer setup
tune
9
tune
Hyperparams
KL, clip, batch
EConstitutional AI & RLAIFUsing AI feedback instead of human feedback
10
gavel
CAI Pipeline
Critique + revise
judge
10
smart_toy
LLM-as-Judge
AI preference labels
1
Title