Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
The paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards. This framework adapts criterion-level reward weights during training to improve the effectiveness of rubric-based rewards. The authors demonstrate that POW3R significantly enhances performance across various policies and datasets compared to traditional methods.
- ▪POW3R preserves human weights and category balance while adapting rewards during training.
- ▪The framework emphasizes criteria that currently distinguish the policy's outputs.
- ▪POW3R outperformed vanilla GRPO with rubric rewards in 24 out of 30 comparisons.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.20164 (cs) [Submitted on 19 May 2026] Title:Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR Authors:Utkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He View a PDF of the paper titled Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR, by Utkarsh Tyagi and 7 other authors View PDF HTML (experimental) Abstract:Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.