Anchor Framework Fixes AI Agent Benchmark Flaws with 300 ERP Tasks
AI agents fail 73% of business tasks, new benchmark reveals hidden flaws
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A persistent challenge in AI agent evaluation is artifact drift: when instructions, environments, oracles, and verifiers are created by separate processes, they often contradict each other, producing unsolvable or reward-hackable benchmarks. Researchers Maksim Ivanov and Abhijay Rana propose Anchor, a task-generation pipeline that formalizes domain expert specifications into constraint optimization programs. From a single parametric description, Anchor jointly produces a natural-language instruction, environment configuration, a solver-certified ground-truth solution, and a state-based verifier. This ensures that all components are aligned, allowing difficulty to be controlled and optimal solutions to be known. The pipeline is harness-agnostic, meaning rewards depend solely on end-state business correctness, not on how the agent achieves it.
To demonstrate Anchor's utility, the team built ERP-Bench, a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows within a production-grade ERP system. Testing frontier AI agents on these tasks revealed sobering results: models satisfied explicit task constraints in only 26.1% of trials and reached a fully optimal solution in just 17.4% of trials. The generation parameters were shown to predict realized difficulty, validating the pipeline's design. By releasing both the task generator and the ERP-Bench dataset, Ivanov and Rana provide a concrete recipe for constructing auditable, realistic evaluation environments for economically valuable agent work, pushing the field toward more reliable and verifiable AI agents.
- Anchor eliminates artifact drift by jointly generating instructions, environments, solutions, and verifiers from a single parametric specification.
- ERP-Bench contains 300 long-horizon tasks in procurement and manufacturing on a real ERP system with controlled difficulty.
- Frontier AI models achieve only 26.1% constraint satisfaction and 17.4% optimal solutions, revealing significant room for improvement.
Why It Matters
This benchmark exposes critical AI agent limitations in enterprise workflows, guiding development of more reliable business automation.