Research & Papers

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

New framework reveals broken benchmarks, not failing agents, cost under $15.

Deep Dive

A new paper from researchers introduces BenchGuard, the first automated auditing framework for task-oriented, execution-based LLM agent benchmarks. The framework uses frontier LLMs as systematic auditors to cross-verify all benchmark artifacts, including specifications, evaluation scripts, and implicit assumptions. It can optionally incorporate agent solutions or execution traces as additional diagnostic evidence. Deployed on two prominent scientific benchmarks, BenchGuard identified 12 author-confirmed issues in ScienceAgentBench, including fatal errors that rendered tasks unsolvable.

On the BIXBench Verified-50 subset, BenchGuard exactly matched 83.3% of expert-identified issues and caught defects that prior human review missed entirely. A full audit of 50 complex bioinformatics tasks costs under USD 15, making automated benchmark auditing practical. The findings point toward AI-assisted benchmark development, where frontier models serve not only as subjects of evaluation but as active participants in validating the evaluation infrastructure itself.

Key Points
  • BenchGuard identified 12 author-confirmed issues in ScienceAgentBench, including fatal errors making tasks unsolvable.
  • On BIXBench Verified-50, it matched 83.3% of expert-identified issues and caught defects missed by human review.
  • A full audit of 50 bioinformatics tasks costs under USD 15, making automated auditing practical and affordable.

Why It Matters

Ensures benchmarks accurately reflect agent capabilities, reducing false failures and improving AI evaluation reliability.