← Back to Artificial Intelligence
cs.AI

When does DPO actually work like RLHF?

Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo

May 20, 2026

DPO's claimed equivalence to RLHF relies on a hidden assumption that often breaks in real training: that the best policy under RLHF actually prefers human-chosen responses. When this fails, DPO optimizes relative rankings against its reference policy rather than alignment with human preferences, creating a failure mode where models decrease loss while learning to prefer worse outputs. The authors introduce Constrained Preference Optimization (CPO) to fix this, adding constraints that guarantee true alignment, and show it outperforms both DPO and RLHF on standard benchmarks. Code is released.
Published as Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment arXiv:2605.20834
Read the original paper →