Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
The paper discusses the conditional equivalence of Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF). It highlights that the theoretical equivalence relies on an implicit assumption that is often violated in practice. The authors propose Constrained Preference Optimization (CPO) as a solution to ensure provable alignment while maintaining simplicity.
- ▪Direct Preference Optimization (DPO) is presented as a simpler alternative to Reinforcement Learning from Human Feedback (RLHF).
- ▪The equivalence between DPO and RLHF is conditional, depending on the assumption that the RLHF-optimal policy must prefer human-preferred responses.
- ▪When this assumption fails, DPO may lead to undesirable outcomes, optimizing relative advantage instead of aligning with human preferences.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.20834 (cs) [Submitted on 20 May 2026] Title:Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment Authors:Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo View a PDF of the paper titled Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment, by Zhiqin Yang and 5 other authors View PDF HTML (experimental) Abstract:Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.