Beyond 80/20: High-Entropy Minority Tokens Drive Effective RL for LLM Reasoning
This study investigates Reinforcement Learning with Verifiable Rewards (RLVR) in large language models by analyzing token entropy patterns during reasoning. It finds that high-entropy minority tokens, which act as decision points in reasoning paths, are primarily responsible for performance gains in RLVR. By focusing policy updates on these forking tokens, the method achieves comparable or superior results to full-gradient updates while using only a fraction of the tokens.
- ▪The research introduces a token entropy perspective to understand how RLVR enhances reasoning in large language models.
- ▪High-entropy tokens, though few in number, serve as critical decision points that guide reasoning pathways in chain-of-thought generation.
- ▪RLVR training mainly adjusts the entropy of high-entropy tokens while preserving the overall entropy structure of the base model.
- ▪Restricting policy gradients to high-entropy tokens maintains or improves performance across multiple Qwen3 model sizes despite using only 20% of tokens.
- ▪Training exclusively on low-entropy tokens leads to significant performance degradation, underscoring the importance of high-entropy tokens in effective reasoning.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computation and Language arXiv:2506.01939 (cs) [Submitted on 2 Jun 2025 (v1), last revised 13 Nov 2025 (this version, v2)] Title:Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning Authors:Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin View a PDF of the paper titled Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, by Shenzhi Wang and 17 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.