Ch 7 — DPO, ORPO & Modern Alignment — Under the Hood

DPO loss implementation, DPOTrainer, ORPOTrainer, KTO code, and practical training recipes
Under the Hood
-
Click play or press Space to begin...
Step- / 10
ADPO Loss & ImplementationThe math and code behind Direct Preference Optimization
1
functions
DPO Loss
From scratch
TRL
2
code
DPOTrainer
Complete script
tune
3
tune
Hyperparams
beta, lr, epochs
4
monitoringMonitor: chosen/rejected rewards, accuracy, margin, loss curves
BORPO ImplementationOne-step alignment without reference model
5
functions
ORPO Loss
SFT + odds ratio
TRL
6
code
ORPOTrainer
Complete script
7
auto_awesomeSimPO: length-normalized rewards without reference model
CKTO & SimPO CodeBinary feedback and reference-free alignment
7
code
SimPO Script
No reference model
alt
8
thumb_up
KTO Script
Unpaired feedback
DDPO + LoRA: The Practical RecipeMemory-efficient alignment with QLoRA
9
code
DPO + QLoRA
Single GPU recipe
EIterative DPO & Online DPOBridging the gap between offline DPO and online PPO
10
replay
Iterative DPO
Generate + re-train
loop
10
online_prediction
Online DPO
TRL OnlineDPOTrainer
1
Title