WeSearch

Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs

·2 min read · 0 reactions · 0 comments · 13 views
#machine learning#artificial intelligence#reinforcement learning
Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs
⚡ TL;DR · AI summary

A new method for enhancing reinforcement learning in large language models (LLMs) has been proposed by researchers Xingwei Gan and Ying Zhu. This method involves averaging the logits of a frozen reference policy and a trainable policy, integrated into Group Relative Policy Optimization (GRPO). The approach shows improved or comparable accuracy on various benchmarks compared to traditional methods that use KL regularization.

Key facts
Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2605.20555 (cs) [Submitted on 19 May 2026] Title:Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs Authors:Xingwei Gan, Ying Zhu View a PDF of the paper titled Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs, by Xingwei Gan and 1 other authors View PDF HTML (experimental) Abstract:We introduce a novel method that averages the logits of a frozen reference policy (e.g., SFT) and a trainable policy, and incorporate the method into Group Relative Policy Optimization (GRPO).

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI