Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
The study explores the impact of different persona vectors on sycophancy in AI models. It compares off-the-shelf persona steering vectors with the standard Contrastive Activation Addition (CAA) method. Results indicate that persona vectors can effectively reduce sycophancy while maintaining accuracy when users are correct.
- ▪The research evaluates the effectiveness of off-the-shelf persona vectors in reducing sycophancy in AI models.
- ▪Steering toward personas characterized by doubt or scrutiny achieves significant reductions in sycophancy compared to the standard CAA method.
- ▪The findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.21006 (cs) [Submitted on 20 May 2026] Title:Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy Authors:Ishaan Kelkar, Nebras Alam, Vikram Kakaria, Madhur Panwar, Vasu Sharma, Maheep Chaudhary View a PDF of the paper titled Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy, by Ishaan Kelkar and 5 other authors View PDF HTML (experimental) Abstract:We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.