SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs
The paper titled 'SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs' addresses limitations in reinforcement learning with verifiable rewards (RLVR) for large language models. It critiques the structural properties of standard RLVR objectives that hinder exploration and proposes a new framework called SAGE to enhance reasoning capabilities. The authors demonstrate that SAGE can improve performance on reasoning tasks by reshaping the reverse-KL anchor distribution.
- ▪Reinforcement learning with verifiable rewards (RLVR) often fails to improve pass@k despite gains in pass@1.
- ▪The authors argue that reverse-KL regularization anchors the policy to the reference distribution, limiting exploration.
- ▪SAGE is proposed as a framework to expand empirical support and improve reasoning performance across benchmarks.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.18864 (cs) [Submitted on 15 May 2026] Title:SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs Authors:Chanuk Lee, Minki Kang, Sung Ju Hwang View a PDF of the paper titled SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs, by Chanuk Lee and 2 other authors View PDF HTML (experimental) Abstract:Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.