WeSearch

Show HN: AgentToolBench-Code – security benchmark for AI coding agents

262588213843476· ·7 min read · 0 reactions · 0 comments · 20 views
#ai#security#benchmarking
Show HN: AgentToolBench-Code – security benchmark for AI coding agents
⚡ TL;DR · AI summary

The AgentToolBench-Code has expanded its security benchmark for AI coding agents from 10 to 16 scenarios. The results indicate that Sonnet 4.6 outperformed Haiku 4.5 in identifying silent security failures. This benchmark aims to provide insights into the capabilities of AI coding agents in real-world scenarios.

Key facts
Original article
Gist · 262588213843476
Read full at Gist →
Opening excerpt (first ~120 words) tap to expand

I doubled my AI-agent security benchmark from 10 scenarios to 16. The "Sonnet vs Haiku tie" disappeared. Draft launch post for AgentToolBench-Code v0.0.1 — not yet published. All numbers verified against examples/claude-code-sonnet-16.jsonl and examples/claude-code-haiku-16.jsonl in the repo. Re-runnable from a clean checkout for ~$4 of Anthropic API. A week ago I shipped v0.0.1 of AgentToolBench-Code, an open-source benchmark for silent security failures in AI coding agents. The first empirical finding — that Claude Code Sonnet 4.6 and Haiku 4.5 scored identically (+5/+10) on a 10-scenario corpus — was striking enough that I wrote it up.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Gist.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Gist