Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
New benchmark tests agents on 8,000+ APIs across 62 domains, exposing critical failure modes in multi-step reasoning.
IBM Research has launched VAKRA, a groundbreaking tool-grounded benchmark designed to rigorously evaluate how well AI agents can reason and act in complex, enterprise-like settings. Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning by having agents interact with a massive ecosystem of over 8,000 locally hosted APIs backed by real databases, spanning 62 different business domains. Tasks require agents to execute 3-7 step reasoning chains that combine structured API interaction with unstructured document retrieval, mimicking real-world workflows where an agent must chain tools and parse information under natural-language constraints.
Initial results from the VAKRA benchmark are sobering, showing that current AI models perform poorly when faced with these realistic, multi-step challenges. The benchmark includes detailed task analysis, such as the 'API Chaining using Business Intelligence APIs' capability with 2,077 test instances, where agents must correctly sequence up to 12 tool calls. By providing full execution traces, VAKRA allows researchers to pinpoint specific failure modes—whether agents struggle with tool selection, argument parsing, or maintaining context across steps. This level of diagnostic detail is crucial for moving beyond simple accuracy scores and understanding why agents fail in practical scenarios.
The release includes the VAKRA dataset, a public leaderboard, and full GitHub access, inviting the broader AI community to test their models. This represents a significant shift towards evaluating AI not just on static question-answering, but on dynamic, executable tasks that reflect the complexity of modern business software ecosystems. The poor performance exposed by VAKRA underscores that while conversational AI has advanced, creating reliable 'agentic' AI that can autonomously complete workflows remains a major unsolved challenge for the industry.
- VAKRA tests AI agents on 8,000+ APIs across 62 real business domains, requiring 3-7 step reasoning chains.
- Initial results show poor model performance, exposing critical gaps in multi-step tool use and workflow execution.
- The benchmark provides full execution traces for detailed failure analysis, moving beyond simple accuracy metrics.
Why It Matters
Exposes the real capability gap for deploying AI agents in enterprise workflows, guiding development toward more reliable, actionable systems.