When does DPO actually work like RLHF?

DPO's claimed equivalence to RLHF relies on a hidden assumption that often breaks in real training: that the best policy under RLHF actually prefers human-chosen responses. When this fails, DPO optimizes relative rankings against its reference policy rather than alignment with human preferences, creating a failure mode where models decrease loss while learning to prefer worse outputs. The authors introduce Constrained Preference Optimization (CPO) to fix this, adding constraints that guarantee true alignment, and show it outperforms both DPO and RLHF on standard benchmarks. Code is released.