WeSearch

Refusal in Language Models Is Mediated by a Single Direction

·3 min read · 0 reactions · 0 comments · 3 views
#machine learning#artificial intelligence#language models#ai safety#model interpretability#Andy Arditi#Oscar Obeso#Aaquib Syed#Daniel Paleka#Nina Panickssery#Wes Gurnee#Neel Nanda#arXiv
Refusal in Language Models Is Mediated by a Single Direction
⚡ TL;DR · AI summary

Researchers have discovered that refusal behavior in large language models is controlled by a single directional subspace within the model's internal representations. By manipulating this direction, they can either disable refusal on harmful requests or induce refusal on harmless ones. This finding reveals vulnerabilities in current safety fine-tuning methods and enables a new class of white-box jailbreak techniques.

Key facts
Original article
Hacker News: Front Page
Read full at Hacker News: Front Page →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2406.11717 (cs) [Submitted on 17 Jun 2024 (v1), last revised 30 Oct 2024 (this version, v3)] Title:Refusal in Language Models Is Mediated by a Single Direction Authors:Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda View a PDF of the paper titled Refusal in Language Models Is Mediated by a Single Direction, by Andy Arditi and 6 other authors View PDF Abstract:Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Hacker News: Front Page.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments