ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

May 27, 2026 · 5:20 PM UTC ·4 min read · 0 reactions · 0 comments · 47 views

#artificial intelligence #enterprise it #benchmarking

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

TL;DR · WeSearch summary

Artificial Analysis and IBM have launched ITBench-AA, a benchmark for evaluating AI models on enterprise IT tasks. The initial results show that frontier models scored below 50% on Site Reliability Engineering tasks. This benchmark aims to assess the capabilities of AI in diagnosing complex IT incidents using Kubernetes incident snapshots.

Key facts

▪ITBench-AA is the first benchmark for agentic enterprise IT tasks developed by Artificial Analysis and IBM Research.
▪Frontier models scored below 50% on the benchmark, with Claude Opus 4.7 leading at 47%.
▪The benchmark includes 59 SRE tasks that require models to identify root-cause entities from Kubernetes incident snapshots.

Original article

Hugging Face Blog

Read full at Hugging Face Blog →

Opening excerpt (first ~120 words) tap to expand

Back to Articles ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM Enterprise Article Published May 27, 2026 Upvote 1 Ayhan Sebin ayhansebin Follow ibm-research Saurabh Jha saurabhjha1 Follow ibm-research Rohan Arora rohan-arora Follow ibm-research Key findings: ITBench-AA SRE overview: Highlights ITBench-AA is built in partnership with @IBMResearch based on their ITBench benchmark. Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50% ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Hugging Face Blog.

Anonymous · no account needed

Discussion

0 comments

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Discussion

More from Hugging Face Blog