Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation
This new framework could finally settle the debate on which AI agent is best.
Researchers introduced Agent-Diff, a novel framework for benchmarking LLM agents on real-world enterprise tasks using external APIs like Slack, Google Calendar, Box, and Linear. It uses a 'state-diff' method to evaluate success based on environmental changes, not just outputs, and a sandboxed environment for standardized testing. The initial benchmark evaluated nine different LLMs across 224 enterprise workflow tasks, providing a more realistic performance comparison for agentic AI systems.
Why It Matters
This provides the first realistic benchmark for comparing AI agents on actual business workflows, moving beyond simple chat tests.