Open Source

Frontier AI Models Fail New ITBench Benchmark, Scoring Below 50% on Enterprise IT Tasks

Claude Opus 4.7 tops at 47%, but all models struggle with Kubernetes incident response in new benchmark.

Deep Dive

Artificial Analysis, in collaboration with IBM Software Innovation Lab, has introduced ITBench-AA, a groundbreaking benchmark series evaluating frontier AI models on agentic enterprise IT tasks. The first iteration targets Site Reliability Engineering (SRE), specifically Kubernetes incident response. Models must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The results are sobering: all frontier models scored below 50%, with Claude Opus 4.7 leading at 47%, GPT-5.5 (xhigh) at 46%, and Qwen3.7 Max at 42%. Open-weight models also struggled—GLM-5.1 (Reasoning) at 40%, tied with Gemini 3.5 Flash (high), while DeepSeek V4 Pro (Reasoning, Max Effort) managed 38%.

The benchmark consists of 59 SRE tasks (40 public, 19 held-out), each providing a Kubernetes incident snapshot with alerts, events, traces, metrics, logs, and topology. Models submit a list of root-cause entities; scoring uses average precision at full recall—missing any ground truth yields zero for that repeat. Notably, models that over-investigate (more turns) do not achieve higher accuracy; GPT-5.5 averaged 31 turns per task at 46%, while Gemini 3.1 Pro Preview used 83 turns for only 30%. The open-source Stirrup harness ensures apples-to-apples comparison. This benchmark highlights a critical gap: even the most advanced AI agents cannot reliably handle real-world enterprise IT incidents, underscoring the need for better reasoning and precision in agentic systems.

Key Points
  • All frontier models scored below 50% on ITBench-AA's 59 Kubernetes SRE tasks, with Claude Opus 4.7 leading at 47%.
  • Over-investigating models (more turns) do not improve accuracy—GPT-5.5 used 31 turns vs. Gemini 3.1 Pro Preview's 83 turns for lower scores.
  • The benchmark uses average precision at full recall: missing any ground-truth root cause yields zero, penalizing false positives.

Why It Matters

Enterprise IT operations remain a critical weakness for AI agents, highlighting the need for more reliable and precise diagnostic capabilities.