The DPO Objective
Given a prompt x, chosen response y_w, and rejected response y_l:
L_DPO = -log sigmoid(beta * (log pi(y_w|x)/pi_ref(y_w|x) - log pi(y_l|x)/pi_ref(y_l|x)))
In words: compute the log-probability ratio of chosen vs rejected responses under the policy, relative to the reference model. Push this ratio to be positive (chosen should be more likely than rejected).
Breaking It Down
log pi(y_w|x) / pi_ref(y_w|x): How much more likely is the chosen response under the current policy vs the reference? This is the "implicit reward" for the chosen response.
log pi(y_l|x) / pi_ref(y_l|x): Same for the rejected response.
The difference: If the policy assigns relatively higher probability to chosen (vs reference) than to rejected (vs reference), the loss is low.
beta: Controls the strength of the KL constraint. Higher beta = more conservative (stays closer to reference). Typical values: 0.1-0.5.
Intuition
DPO simultaneously does two things:
1. Increases the probability of chosen responses (relative to the reference model)
2. Decreases the probability of rejected responses (relative to the reference model)
The "relative to reference" part is crucial. It prevents the model from just increasing all probabilities (which would be meaningless) and acts as the implicit KL constraint.
# DPO loss in pseudocode
chosen_logps = policy.log_prob(chosen)
rejected_logps = policy.log_prob(rejected)
ref_chosen_logps = ref_model.log_prob(chosen)
ref_rejected_logps = ref_model.log_prob(rejected)
chosen_reward = beta * (chosen_logps - ref_chosen_logps)
rejected_reward = beta * (rejected_logps - ref_rejected_logps)
loss = -log_sigmoid(chosen_reward - rejected_reward)
This is just binary cross-entropy. DPO is essentially a classification problem: given two responses, classify which one is better. The "features" are the log-probability ratios. This makes DPO as simple to implement and train as SFT.