Ch 7: DPO, ORPO & Modern Alignment

Ch 7 — DPO, ORPO & Modern Alignment

Direct Preference Optimization, ORPO, SimPO, KTO, and the shift away from RLHF

Index Under the Hood →

High Level

lightbulb

Insight

arrow_forward

functions

DPO Math

arrow_forward

compare

DPO vs PPO

arrow_forward

merge

ORPO

arrow_forward

auto_awesome

SimPO/KTO

arrow_forward

science

IPO/cDPO

arrow_forward

checklist

Choose

Click play or press Space to begin the journey...

Step- / 7

lightbulb

The Key Insight Behind DPO

Why we can skip the reward model and RL entirely

The RLHF Problem

RLHF has three stages: SFT, reward model training, and PPO. The last two stages are complex, expensive, and unstable. PPO requires 4 models in memory and careful hyperparameter tuning.

The question: Can we skip the reward model and RL, and directly optimize the policy on preference data?

The DPO Insight

Rafailov et al. (2023, Stanford) showed that the optimal policy under the RLHF objective has a closed-form solution. The reward function can be expressed implicitly in terms of the policy itself:

r(x, y) = beta * log(pi(y|x) / pi_ref(y|x)) + const

This means: instead of training a separate reward model and then optimizing against it with RL, you can directly optimize the policy to increase the probability of chosen responses and decrease the probability of rejected responses, relative to a reference model.

The reward model is implicit in the policy itself.

What This Eliminates

No reward model: Don't need to train a separate model to score responses.

No RL: No PPO, no value model, no advantage estimation, no clipping.

No generation during training: DPO works on pre-collected preference pairs. No need to generate responses during training (which is the slowest part of PPO).

Result: DPO reduces alignment from a complex 3-stage pipeline to a single supervised learning step on preference data.

DPO is to RLHF what SFT is to pre-training. Just as SFT simplified instruction-following from a complex RL problem to supervised learning, DPO simplifies alignment from RL to supervised learning. The key: finding the right loss function that implicitly captures what RLHF does explicitly.

functions

The DPO Loss Function

A simple classification loss that replaces RLHF

The DPO Objective

Given a prompt x, chosen response y_w, and rejected response y_l:

L_DPO = -log sigmoid(beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x)))

In words: compute the log-probability ratio of chosen vs rejected responses under the policy, relative to the reference model. Push this ratio to be positive (chosen should be more likely than rejected).

Breaking It Down

log pi(y_w|x) / pi_ref(y_w|x): How much more likely is the chosen response under the current policy vs the reference? This is the "implicit reward" for the chosen response.

log pi(y_l|x) / pi_ref(y_l|x): Same for the rejected response.

The difference: If the policy assigns relatively higher probability to chosen (vs reference) than to rejected (vs reference), the loss is low.

beta: Controls the strength of the KL constraint. Higher beta = more conservative (stays closer to reference). Typical values: 0.1-0.5.

Intuition

DPO simultaneously does two things:

1. Increases the probability of chosen responses (relative to the reference model)

2. Decreases the probability of rejected responses (relative to the reference model)

The "relative to reference" part is crucial. It prevents the model from just increasing all probabilities (which would be meaningless) and acts as the implicit KL constraint.

# DPO loss in pseudocode chosen_logps = policy.log_prob(chosen) rejected_logps = policy.log_prob(rejected) ref_chosen_logps = ref_model.log_prob(chosen) ref_rejected_logps = ref_model.log_prob(rejected) chosen_reward = beta * (chosen_logps - ref_chosen_logps) rejected_reward = beta * (rejected_logps - ref_rejected_logps) loss = -log_sigmoid(chosen_reward - rejected_reward)

This is just binary cross-entropy. DPO is essentially a classification problem: given two responses, classify which one is better. The "features" are the log-probability ratios. This makes DPO as simple to implement and train as SFT.

compare

DPO vs PPO: Practical Comparison

Cost, complexity, and quality trade-offs

DPO
2 models (policy + ref)
No generation during training
Simple loss function
1-2 A100s for 7B
Few hyperparams (beta, lr)
Stable training
Hours to train

PPO
4 models (policy + ref + RM + value)
Generation every step
Complex RL objective
4-8 A100s for 7B
Many hyperparams
Unstable, reward hacking
Days to train

When DPO Wins

Most open-source alignment: Zephyr, Tulu 2, Neural Chat, OpenHermes, and most community models use DPO.

Limited compute: DPO needs 2-4x less GPU memory than PPO.

Rapid iteration: DPO trains in hours, PPO takes days. Better for experimentation.

Stability: DPO rarely diverges. PPO requires careful monitoring.

When PPO Might Win

Frontier models: OpenAI, Anthropic, and Google still use PPO (or variants) for their flagship models. At massive scale with expert tuning, PPO can squeeze out extra quality.

Online learning: PPO generates new responses during training, so it can explore and discover better responses. DPO only learns from pre-collected data (offline).

Complex reward signals: PPO can optimize for any reward function (safety classifiers, factuality checkers, etc.). DPO is limited to pairwise preferences.

The consensus (2024-2025): DPO is the default choice for alignment. Use PPO only if you have the compute, expertise, and a specific reason (e.g., online learning, complex reward signals). For most practitioners, DPO gives 90-95% of PPO quality at 20-30% of the cost.

merge

ORPO: Odds Ratio Preference Optimization

Combining SFT and alignment in a single step

The ORPO Idea

Hong et al. (2024) observed that DPO still requires a separate SFT step first (to create the reference model). ORPO eliminates this by combining SFT and preference optimization into a single training step.

How: ORPO adds a preference-aware penalty to the standard SFT loss. The SFT loss teaches the model to generate good responses, while the odds ratio penalty teaches it to prefer chosen over rejected.

The ORPO Loss

L_ORPO = L_SFT + lambda * L_OR

Where:
- L_SFT = standard cross-entropy on chosen responses
- L_OR = -log sigmoid(log(odds(chosen) / odds(rejected)))
- odds(y) = P(y) / (1 - P(y))
- lambda = weighting factor (typically 0.1-1.0)

The odds ratio naturally captures how much more likely one response is than another, without needing a reference model.

ORPO Advantages

No reference model: Unlike DPO, ORPO doesn't need a frozen reference model in memory. This halves the memory requirement.

No SFT stage: ORPO combines SFT and alignment, saving one full training run.

Simpler pipeline: Base model + preference data = aligned model. One step.

DPO Pipeline
1. SFT on demonstrations
2. DPO on preferences
Models: policy + reference
Two training runs

ORPO Pipeline
1. ORPO on preferences
(SFT is built in)
Models: policy only
One training run

Trade-off: ORPO is simpler and cheaper, but DPO with a well-tuned SFT stage can sometimes produce better results. ORPO works best when you have good preference data that also serves as SFT data (the chosen responses are high quality demonstrations).

auto_awesome

SimPO & KTO

Further simplifications of preference optimization

SimPO: Simple Preference Optimization

Meng et al. (2024) simplified DPO further by removing the reference model entirely. Instead of comparing log-probabilities to a reference, SimPO uses the average log-probability of the response as the implicit reward:

reward(y) = (1/|y|) * sum(log pi(y_t | y_<t, x))

This is just the average per-token log-probability, which naturally penalizes verbose responses (length normalization is built in).

Key benefit: No reference model needed. Half the memory of DPO. And the length normalization addresses the verbosity problem that plagues DPO and PPO.

KTO: Kahneman-Tversky Optimization

Ethayarajh et al. (2024) addressed a different problem: DPO requires paired preferences (chosen AND rejected for the same prompt). KTO works with unpaired binary feedback:

- "This response is good" (thumbs up)
- "This response is bad" (thumbs down)

No need to pair them. This is much easier to collect in production (users give thumbs up/down on individual responses).

Based on: Kahneman-Tversky prospect theory from behavioral economics. Losses (bad responses) are weighted more heavily than gains (good responses), matching how humans perceive quality.

SimPO and KTO solve real practical problems. SimPO removes the reference model (memory savings) and fixes verbosity. KTO removes the need for paired data (easier data collection). Both achieve competitive quality with DPO. Choose based on your constraints: paired data available? Use DPO. Only binary feedback? Use KTO. Memory constrained? Use SimPO or ORPO.

science

IPO, cDPO & Other Variants

Addressing DPO's theoretical limitations

IPO: Identity Preference Optimization

Azar et al. (2024, Google DeepMind) identified a theoretical issue with DPO: it can overfit to the preference data because the sigmoid loss saturates. When the model becomes very confident, the gradients vanish and it stops learning.

IPO replaces the sigmoid loss with a squared loss that doesn't saturate:

L_IPO = (log(pi(y_w)/pi_ref(y_w)) - log(pi(y_l)/pi_ref(y_l)) - 1/(2*beta))^2

This keeps gradients flowing even when the model is confident, leading to better generalization.

cDPO: Conservative DPO

Mitchell et al. (2023) addressed noisy preference labels. In real data, annotators disagree 25-35% of the time. Standard DPO treats all labels as correct, which can hurt when labels are wrong.

cDPO adds a label smoothing parameter that accounts for noise:

L_cDPO uses P(chosen is actually better) = 1 - epsilon

With epsilon = 0.1, the model doesn't fully trust any single preference label. This makes training more robust to annotation noise.

Other Notable Variants

RSO (Statistical Rejection Sampling): Generates responses from the optimal policy (not the SFT model) for better DPO training data.

SPIN (Self-Play): The model plays against itself to generate increasingly better preference data iteratively.

Iterative DPO: Run DPO, generate new responses with the improved model, collect new preferences, run DPO again. Bridges the offline/online gap.

In practice: Standard DPO works well for most use cases. Use cDPO if your preference data is noisy. Use IPO if you see DPO overfitting. Use iterative DPO if you want to bridge the gap with online PPO. The variants are refinements, not replacements.

checklist

Choosing an Alignment Method

Decision framework for practitioners

Decision Tree

Q: Do you have paired preference data?
→ No, only thumbs up/down: Use KTO
→ Yes: Continue below

Q: Do you have a separate SFT dataset?
→ No (preference data is your only data): Use ORPO
→ Yes: Continue below

Q: Is memory a constraint (single GPU)?
→ Yes: Use SimPO or ORPO (no reference model)
→ No: Continue below

Q: Is your preference data noisy?
→ Yes: Use cDPO or IPO
→ No: Use DPO (the default)

Quick Reference

DPO: Default choice. Needs SFT first + reference model. Well-tested, widely used.

ORPO: One-step alignment. No SFT stage, no reference model. Simplest pipeline.

SimPO: Like DPO but no reference model. Built-in length normalization.

KTO: Works with unpaired binary feedback. Easiest data collection.

IPO: Better generalization than DPO. Use if DPO overfits.

cDPO: Robust to noisy labels. Use with imperfect annotations.

PPO: Maximum control. Use only at frontier scale with expert teams.

The recommended starting point: SFT on your demonstration data, then DPO on your preference data. This is the most well-understood pipeline with the most community support. Only switch to alternatives if you have a specific constraint (no paired data, no SFT data, memory limits, noisy labels).