Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems
New benchmark with 72 tasks reveals agents falsely claim completeness when missing key evidence.
Enterprise AI agents increasingly operate in environments with scoped retrieval and access controls. A new benchmark from researcher Krti Tallam, introduced in the paper "Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems," targets a dangerous blind spot: agents that return answers appearing complete even though critical evidence lies outside the caller's authorization boundary. This “silent filtering” failure can mislead decision-makers in high-stakes contexts.
The benchmark ships 72 tasks across three scenario families—due diligence, compliance audit, and security incident response—using deterministic synthetic corpora with ACL partitioning. Each task includes oracle complete answers, oracle authorized-view answers, oracle completeness judgments, and structured gap-report oracles. It evaluates systems on four surfaces: answer correctness, completeness awareness, gap-report quality, and unsafe completeness behavior. Checked-in baselines reveal that silent filtering is catastrophically unsafe across all families, while explicit fail-and-report behavior eliminates unsafe completeness without collapsing tasks into trivial abstention.
Preliminary runs on real models show model-dependent and scenario-sensitive differences in how systems handle incomplete evidence. Some overclaim completeness, others conservatively underclaim, and a few produce enterprise-usable incompleteness reports. The benchmark’s key contribution is making a governance-critical agent failure measurable without human judges or contamination-prone static corpora, providing a deterministic tool for auditing AI agents in regulated environments.
This benchmark addresses a growing concern as AI agents are deployed for compliance, auditing, and security operations. The ability to detect when an agent lacks full evidence—and to generate usable gap reports—is essential for responsible deployment. Partial Evidence Bench offers a repeatable way to test these capabilities before real-world use.
- 72 deterministic tasks across due diligence, compliance audit, and security incident response
- Silent filtering was catastrophically unsafe; explicit fail-and-report behavior avoided false completeness
- Evaluates answer correctness, completeness awareness, gap-report quality, and unsafe completeness behavior
Why It Matters
A measurable way to prevent AI agents from misleading enterprise users by hiding missing evidence behind access controls.