AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
The paper introduces Asymmetric Meta-Reflective Self-Distillation (AMR-SD) as a solution for token-level credit assignment in Large Language Models. It addresses the limitations of existing algorithms that apply uniform rewards, leading to credit-assignment bottlenecks. The proposed method enhances performance by using reflective bottlenecks and causal information gain to improve stability and prevent training collapse.
- ▪AMR-SD aims to improve the alignment of Large Language Models for complex reasoning tasks.
- ▪The method incorporates a reflection bottleneck to compress diagnostic signals into concise hints and critiques.
- ▪Experiments show that AMR-SD significantly outperforms existing baselines across various benchmarks.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.18529 (cs) [Submitted on 18 May 2026] Title:AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment Authors:Zhenlin Wei, Pu Jian, Yingzhuo Deng, Xiaohan Wang, Jiajun Chai, Zhexin Hu, Wei Lin, Shanbin Zhang, Guojun Yin View a PDF of the paper titled AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment, by Zhenlin Wei and 8 other authors View PDF HTML (experimental) Abstract:The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.