Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
The paper discusses a new approach to on-policy reinforcement learning that addresses the issue of mode collapse. The proposed method, DMPO, enhances solution diversity by aligning policy distributions with a target distribution based on rewards. This approach has shown significant improvements in various reasoning tasks, demonstrating its effectiveness in maintaining exploration during training.
- ▪On-policy reinforcement learning methods like GRPO often suffer from mode collapse, leading to reduced solution diversity.
- ▪The proposed DMPO method prevents mode collapse by approximating forward KL minimization and aligning policy distributions with a target distribution based on rewards.
- ▪DMPO achieved notable improvements in quality ratios on NP-hard combinatorial optimization tasks, outperforming GRPO in both text-based and vision-based benchmarks.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.19461 (cs) [Submitted on 19 May 2026] Title:Beyond Mode Collapse: Distribution Matching for Diverse Reasoning Authors:Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, Kai Chen View a PDF of the paper titled Beyond Mode Collapse: Distribution Matching for Diverse Reasoning, by Xiaozhe Li and 12 other authors View PDF HTML (experimental) Abstract:On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.