← Back to Artificial Intelligence cs.AI
When does DPO actually work like RLHF?
Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo
May 20, 2026
DPO's claimed equivalence to RLHF relies on a hidden assumption that often breaks in real training: that the best policy under RLHF actually prefers human-chosen responses. When this fails, DPO optimizes relative rankings against its reference policy rather than alignment with human preferences, creating a failure mode where models decrease loss while learning to prefer worse outputs. The authors introduce Constrained Preference Optimization (CPO) to fix this, adding constraints that guarantee true alignment, and show it outperforms both DPO and RLHF on standard benchmarks. Code is released.
Read the original paper →