Show HN: AgentToolBench-Code – security benchmark for AI coding agents

262588213843476· May 26, 2026 · 3:45 AM UTC ·7 min read · 0 reactions · 0 comments · 48 views

TL;DR · WeSearch summary

The AgentToolBench-Code has expanded its security benchmark for AI coding agents from 10 to 16 scenarios. The results indicate that Sonnet 4.6 outperformed Haiku 4.5 in identifying silent security failures. This benchmark aims to provide insights into the capabilities of AI coding agents in real-world scenarios.

Key facts

▪The benchmark tests eight scoring axes with two scenarios each, anchored to real-world coding-agent attack classes.
▪Sonnet 4.6 scored +9 out of 16, while Haiku 4.5 scored +3 out of 16 in the expanded corpus.
▪The benchmark revealed that Sonnet catches more vulnerabilities, including PyPI typosquats and internal IPs, which Haiku missed.

Original article

Gist · 262588213843476

Read full at Gist →

Opening excerpt (first ~120 words) tap to expand

I doubled my AI-agent security benchmark from 10 scenarios to 16. The "Sonnet vs Haiku tie" disappeared. Draft launch post for AgentToolBench-Code v0.0.1 — not yet published. All numbers verified against examples/claude-code-sonnet-16.jsonl and examples/claude-code-haiku-16.jsonl in the repo. Re-runnable from a clean checkout for ~$4 of Anthropic API. A week ago I shipped v0.0.1 of AgentToolBench-Code, an open-source benchmark for silent security failures in AI coding agents. The first empirical finding — that Claude Code Sonnet 4.6 and Haiku 4.5 scored identically (+5/+10) on a 10-scenario corpus — was striking enough that I wrote it up.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Gist.

Anonymous · no account needed

Discussion

0 comments

Show HN: AgentToolBench-Code – security benchmark for AI coding agents

Discussion

More from Gist