WeSearch

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

·3 min read · 0 reactions · 0 comments · 11 views
#artificial intelligence#reinforcement learning#machine learning
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning
⚡ TL;DR · AI summary

The paper discusses a new approach to on-policy reinforcement learning that addresses the issue of mode collapse. The proposed method, DMPO, enhances solution diversity by aligning policy distributions with a target distribution based on rewards. This approach has shown significant improvements in various reasoning tasks, demonstrating its effectiveness in maintaining exploration during training.

Key facts
Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Artificial Intelligence arXiv:2605.19461 (cs) [Submitted on 19 May 2026] Title:Beyond Mode Collapse: Distribution Matching for Diverse Reasoning Authors:Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, Kai Chen View a PDF of the paper titled Beyond Mode Collapse: Distribution Matching for Diverse Reasoning, by Xiaozhe Li and 12 other authors View PDF HTML (experimental) Abstract:On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies.

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI