Test-Time Training Undermines Safety Guardrails
The paper discusses the emerging paradigm of Test-Time Training (TTT) and its implications for model safety. While TTT enhances performance in various tasks, it also introduces vulnerabilities that can be exploited by adversaries. The authors propose a lightweight detection method to address these security concerns.
- ▪Test-Time Training allows models to adapt their parameters during inference, improving performance on tasks like few-shot learning.
- ▪The study identifies three threat models for TTT, demonstrating how attackers can exploit them to bypass safety filters.
- ▪TTT significantly increases the Attack Success Rate, with averages of 95% and 93% for different threat models across various models.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.22984 (cs) [Submitted on 21 May 2026] Title:Test-Time Training Undermines Safety Guardrails Authors:Simone Antonelli, Sadegh Akhondzadeh, Aleksandar Bojchevski View a PDF of the paper titled Test-Time Training Undermines Safety Guardrails, by Simone Antonelli and 2 other authors View PDF HTML (experimental) Abstract:Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few-shot learning, retrieval-augmented generation, and complex reasoning. However, this dynamic adaptation introduces new vulnerabilities that adversaries can exploit to jailbreak models.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.