ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM
Artificial Analysis and IBM have launched ITBench-AA, a benchmark for evaluating AI models on enterprise IT tasks. The initial results show that frontier models scored below 50% on Site Reliability Engineering tasks. This benchmark aims to assess the capabilities of AI in diagnosing complex IT incidents using Kubernetes incident snapshots.
- ▪ITBench-AA is the first benchmark for agentic enterprise IT tasks developed by Artificial Analysis and IBM Research.
- ▪Frontier models scored below 50% on the benchmark, with Claude Opus 4.7 leading at 47%.
- ▪The benchmark includes 59 SRE tasks that require models to identify root-cause entities from Kubernetes incident snapshots.
Opening excerpt (first ~120 words) tap to expand
Back to Articles ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM Enterprise Article Published May 27, 2026 Upvote 1 Ayhan Sebin ayhansebin Follow ibm-research Saurabh Jha saurabhjha1 Follow ibm-research Rohan Arora rohan-arora Follow ibm-research Key findings: ITBench-AA SRE overview: Highlights ITBench-AA is built in partnership with @IBMResearch based on their ITBench benchmark. Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50% ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Hugging Face Blog.