Ch 7: DPO, ORPO & Modern Alignment

Ch 7 — DPO, ORPO & Modern Alignment — Under the Hood

DPO loss implementation, DPOTrainer, ORPOTrainer, KTO code, and practical training recipes

Under the Hood

Click play or press Space to begin...

Step- / 10

ADPO Loss & ImplementationThe math and code behind Direct Preference Optimization

functions

DPO Loss

From scratch

TRL

code

DPOTrainer

Complete script

tune

Hyperparams

beta, lr, epochs

monitoringMonitor: chosen/rejected rewards, accuracy, margin, loss curves

BORPO ImplementationOne-step alignment without reference model

functions

ORPO Loss

SFT + odds ratio

TRL

code

ORPOTrainer

Complete script

auto_awesomeSimPO: length-normalized rewards without reference model

CKTO & SimPO CodeBinary feedback and reference-free alignment

code

SimPO Script

No reference model

alt

thumb_up

KTO Script

Unpaired feedback

DDPO + LoRA: The Practical RecipeMemory-efficient alignment with QLoRA

code

DPO + QLoRA

Single GPU recipe

EIterative DPO & Online DPOBridging the gap between offline DPO and online PPO

replay

Iterative DPO

Generate + re-train

loop

online_prediction

Online DPO

TRL OnlineDPOTrainer