Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges
The paper introduces BITE, a framework designed to exploit stylistic biases in LLM judges. It demonstrates that these biases can be manipulated to artificially inflate scores without altering the underlying semantics. The findings highlight vulnerabilities in the LLM-as-a-judge paradigm and call for more robust evaluation methods.
- ▪BITE achieves an attack success rate exceeding 65%.
- ▪The framework raises scores by 1-2 points on a 9-point scale.
- ▪BITE evades standard style-control methods and several detection baselines.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Cryptography and Security arXiv:2605.26156 (cs) [Submitted on 24 May 2026] Title:Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges Authors:Xianglin Yang, Bryan Hooi, Gelei Deng, Tianwei Zhang, Jin Song Dong View a PDF of the paper titled Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges, by Xianglin Yang and Bryan Hooi and Gelei Deng and Tianwei Zhang and Jin Song Dong View PDF HTML (experimental) Abstract:The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.