Ch 6: Alignment — RLHF & Reward Models

Ch 6 — RLHF & Reward Models — Under the Hood

Preference data format, reward model training code, PPO with TRL, and practical RLHF recipes

Index ← High Level

Under the Hood

Click play or press Space to begin...

Step- / 10

APreference Data Format & LoadingHow preference datasets are structured for training

dataset

Data Format

prompt/chosen/rejected

load

transform

Preprocessing

Tokenize pairs

compareBradley-Terry model: P(y_w > y_l) = sigmoid(r(y_w) - r(y_l))

BReward Model TrainingArchitecture, loss function, and training with TRL

architecture

RM Architecture

LLM + value head

train

code

RewardTrainer

TRL library

eval

assessment

RM Evaluation

Accuracy metrics

syncPPO loop: generate → score → compute advantage → update policy

CPPO Training with TRLThe four-model setup and training loop

psychology

Policy Model

Active, generates

lock

Reference Model

Frozen SFT copy

score

star

Reward Model

Frozen, scores

DPPO Implementation CodeComplete PPO training script with TRL PPOTrainer

code

PPO Script

PPOTrainer setup

tune

Hyperparams

KL, clip, batch

EConstitutional AI & RLAIFUsing AI feedback instead of human feedback

gavel

CAI Pipeline

Critique + revise

judge

smart_toy

LLM-as-Judge

AI preference labels