Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety
A new meta-audit framework reveals significant blind spots in how we test AI agents that use tools.
A team of researchers has introduced SafeAudit, a new framework designed to audit the test suites used to evaluate the safety of Large Language Model (LLM) agents. As AI agents increasingly move beyond text generation to take actions through external tools (like APIs), their safety depends on these complex tool-call workflows. The paper, "Who Tests the Testers? Systematic Enumeration and Coverage Audit of LLM Agent Tool Call Safety," argues that while benchmarks exist, their completeness is unknown. SafeAudit addresses this by using an LLM-based enumerator to systematically generate a vast range of potential test cases, covering diverse user scenarios and valid tool-call sequences.
The framework's second key innovation is a quantitative metric called "rule-resistance." This metric distills the implicit safety rules from existing benchmarks and then identifies unsafe interaction patterns that those rules fail to cover. When applied across 3 popular benchmarks and 12 different agent environments, SafeAudit consistently uncovered a significant residual risk: more than 20% of unsafe behaviors were missed by the original test suites. The coverage of these hidden risks grew monotonically as the testing budget increased, proving the benchmarks were incomplete. The findings make a compelling case for "meta-auditing"—testing the tests themselves—as a necessary complement to standard benchmark-based evaluations for agentic AI.
- SafeAudit, a meta-audit framework, found existing safety benchmarks miss over 20% of unsafe LLM agent behaviors.
- It uses an LLM to enumerate test cases and a 'rule-resistance' metric to find gaps in benchmark coverage.
- The research audited 3 benchmarks across 12 environments, proving current agent safety testing has significant blind spots.
Why It Matters
As companies deploy AI agents that act in the real world, incomplete safety testing poses a major operational and reputational risk.