A Single Neuron Is Sufficient to Bypass Safety Alignment in LLMs
A new study reveals that safety alignment in large language models can be compromised by manipulating just a single neuron. Researchers demonstrated that suppressing specific refusal neurons allows models to bypass safety protocols and generate harmful content. The findings were consistent across multiple models and sizes, indicating a systemic vulnerability in current alignment mechanisms.
- ▪Safety alignment in language models relies on distinct refusal and concept neurons.
- ▪Suppressing a single refusal neuron can bypass safety alignment across various harmful requests.
- ▪The study tested seven models ranging from 1.7B to 70B parameters, showing consistent results without training or prompt engineering.
- ▪Amplifying certain neurons can also generate harmful content from benign prompts.
- ▪The research highlights that safety mechanisms are not robustly distributed but depend on individual, critical neurons.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computation and Language arXiv:2605.08513 (cs) [Submitted on 8 May 2026] Title:A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models Authors:Hamid Kazemi, Atoosa Chegini, Maria Safi View a PDF of the paper titled A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models, by Hamid Kazemi and 2 other authors View PDF Abstract:Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.