AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback
The paper presents Adaptive Group Policy Optimization (AGPO), a new method for improving reinforcement learning in large language models. AGPO utilizes group-level statistics to enhance training efficiency by controlling update magnitude and exploration. The results indicate that models trained with AGPO outperform traditional methods on various benchmarks.
- ▪AGPO is a critic-free refinement of GRPO that enhances training stability.
- ▪It employs adaptive clipping and bidirectional adaptive temperature sampling for better performance.
- ▪Models trained with AGPO achieved significant improvements on math and STEM benchmarks.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.20722 (cs) [Submitted on 20 May 2026] Title:AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback Authors:Miaobo Hu, Shuhao Hu, Bokun Wang, Ruohan Wang, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao View a PDF of the paper titled AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback, by Miaobo Hu and 6 other authors View PDF Abstract:Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.