Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models
The paper discusses a new framework for safe fine-tuning of large language models (LLMs) called Buffer-and-Reinforce. This framework utilizes temporary jailbreaking to mitigate harmful updates during user fine-tuning while preserving performance. The authors present experimental results demonstrating the framework's effectiveness in enhancing safety without additional safety data or significant computational costs.
- ▪Fine-tuning-as-a-Service (FaaS) can weaken safety-alignment under harmful fine-tuning attacks.
- ▪The proposed Buffer-and-Reinforce framework buffers harmful updates and reinforces safety after adaptation.
- ▪Extensive experiments show that the framework achieves superior safety and utility with minimal computational cost.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Artificial Intelligence arXiv:2605.24550 (cs) [Submitted on 23 May 2026] Title:Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models Authors:Seokil Ham, Jaehyuk Jang, Wonjun Lee, Changick Kim View a PDF of the paper titled Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models, by Seokil Ham and 3 other authors View PDF HTML (experimental) Abstract:Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful fine-tuning attacks.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.