Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

May 20, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 11 views

#artificial intelligence #reinforcement learning #machine learning

⚡ TL;DR · AI summary

The paper discusses a new approach to on-policy reinforcement learning that addresses the issue of mode collapse. The proposed method, DMPO, enhances solution diversity by aligning policy distributions with a target distribution based on rewards. This approach has shown significant improvements in various reasoning tasks, demonstrating its effectiveness in maintaining exploration during training.

Key facts

▪On-policy reinforcement learning methods like GRPO often suffer from mode collapse, leading to reduced solution diversity.
▪The proposed DMPO method prevents mode collapse by approximating forward KL minimization and aligning policy distributions with a target distribution based on rewards.
▪DMPO achieved notable improvements in quality ratios on NP-hard combinatorial optimization tasks, outperforming GRPO in both text-based and vision-based benchmarks.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Artificial Intelligence arXiv:2605.19461 (cs) [Submitted on 19 May 2026] Title:Beyond Mode Collapse: Distribution Matching for Diverse Reasoning Authors:Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, Kai Chen View a PDF of the paper titled Beyond Mode Collapse: Distribution Matching for Diverse Reasoning, by Xiaozhe Li and 12 other authors View PDF HTML (experimental) Abstract:On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

Discussion

More from arXiv cs.AI