SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

May 20, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 27 views

#machine learning #artificial intelligence #reinforcement learning

TL;DR · WeSearch summary

The paper titled 'SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs' addresses limitations in reinforcement learning with verifiable rewards (RLVR) for large language models. It critiques the structural properties of standard RLVR objectives that hinder exploration and proposes a new framework called SAGE to enhance reasoning capabilities. The authors demonstrate that SAGE can improve performance on reasoning tasks by reshaping the reverse-KL anchor distribution.

Key facts

▪Reinforcement learning with verifiable rewards (RLVR) often fails to improve pass@k despite gains in pass@1.
▪The authors argue that reverse-KL regularization anchors the policy to the reference distribution, limiting exploration.
▪SAGE is proposed as a framework to expand empirical support and improve reasoning performance across benchmarks.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2605.18864 (cs) [Submitted on 15 May 2026] Title:SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs Authors:Chanuk Lee, Minki Kang, Sung Ju Hwang View a PDF of the paper titled SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs, by Chanuk Lee and 2 other authors View PDF HTML (experimental) Abstract:Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

Discussion

More from arXiv cs.AI