Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
The paper discusses a novel approach to jailbreak attacks on Large Reasoning Models (LRMs) using reinforcement learning. It highlights the correlation between the attack success rate and the attention patterns of LRMs. The proposed method improves effectiveness and efficiency in executing these attacks compared to existing strategies.
- ▪Large Reasoning Models have shown vulnerability to jailbreak attacks.
- ▪The success of these attacks is linked to how attention is allocated in the model's reasoning process.
- ▪The authors propose a reinforcement learning-based method that incorporates attention signals to enhance attack effectiveness.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.19485 (cs) [Submitted on 19 May 2026] Title:Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models Authors:Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao View a PDF of the paper titled Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models, by Zheng Lin and 4 other authors View PDF HTML (experimental) Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.