Refusal in Language Models Is Mediated by a Single Direction
Researchers have discovered that refusal behavior in large language models is controlled by a single directional subspace within the model's internal representations. By manipulating this direction, they can either disable refusal on harmful requests or induce refusal on harmless ones. This finding reveals vulnerabilities in current safety fine-tuning methods and enables a new class of white-box jailbreak techniques.
- ▪Refusal behavior in 13 open-source chat models is mediated by a single one-dimensional subspace.
- ▪Erasing this direction prevents refusal on harmful instructions, while adding it triggers refusal on benign ones.
- ▪The study introduces a white-box jailbreak method that disables refusal with minimal impact on other model capabilities.
- ▪Adversarial suffixes are shown to suppress the propagation of the refusal-mediating direction.
- ▪The results highlight the fragility of current safety fine-tuning approaches in language models.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2406.11717 (cs) [Submitted on 17 Jun 2024 (v1), last revised 30 Oct 2024 (this version, v3)] Title:Refusal in Language Models Is Mediated by a Single Direction Authors:Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda View a PDF of the paper titled Refusal in Language Models Is Mediated by a Single Direction, by Andy Arditi and 6 other authors View PDF Abstract:Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Hacker News: Front Page.