Multi-turn jailbreak rates across 15 frontier models (Grok 88%, Claude 12%)
A recent evaluation of 15 frontier large language models (LLMs) reveals that single-turn attack success rates are not reliable indicators of multi-turn vulnerabilities. The study found multi-turn attack success rates ranging from 7.89% to 88.30%, indicating significant risks across all models tested. This highlights the need for more comprehensive evaluation methods that account for iterative adversarial behavior.
- ▪The evaluation included flagship models from OpenAI, Anthropic, Google, Amazon, and xAI.
- ▪Multi-turn attack success rates were significantly higher than single-turn rates, with some models showing increases of up to 9 times.
- ▪Every model tested exhibited non-trivial multi-turn attack success rates, indicating vulnerabilities under iterative pressure.
Opening excerpt (first ~120 words) tap to expand
May 27, 2026 Leave a Comment Artificial Intelligence - AI Proprietary Problems: No Frontier Model Is Multi-Turn Immune6 min read Nicholas Conley, Amy Chang The dominant safety benchmarks for frontier large language models (LLMs) share a structural assumption: that a single prompt and a single model response are enough to characterize how a model behaves under adversarial attack. These benchmarks inform model cards, safety reports, and procurement decisions across the industry, but they all only measure one narrow slice of attacker behavior.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Cisco Blogs.