PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs
The article introduces PoisonForge, a benchmark designed to evaluate task-level targeted poisoning in instruction-tuned large language models (LLMs). It highlights how adversaries can exploit unvetted datasets to insert crafted instruction-response pairs, leading to high attack success rates. The study analyzes various factors contributing to the effectiveness of these attacks and provides resources for reproducible research.
- ▪PoisonForge evaluates 12 open-weight models across five families with a primarily 1% poison budget.
- ▪With only 10 poisoned examples among 1,000 fine-tuning examples, 11 of 12 models exceeded a 70% attack success rate.
- ▪The study found that multiple appearances of an entity increase the attack success rate, and optimal poisoning modes depend on the target entity's semantic structure.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Cryptography and Security arXiv:2605.23168 (cs) [Submitted on 22 May 2026] Title:PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs Authors:Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru, Alina Oprea View a PDF of the paper titled PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs, by Luze Sun and 4 other authors View PDF HTML (experimental) Abstract:When practitioners fine-tune LLMs on unvetted datasets, an adversary can exploit the data supply chain through task-level poisoning: inserting a small number of crafted instruction-response pairs that cause the model to embed attacker-specified entities, such as a country, in outputs for a targeted task family while behaving normally elsewhere.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.