Research & Papers

Amazon's audit-then-score protocol boosts AI fact-checking accuracy from 60.8% to 90.9%

When expert humans scored only 60.8%, Amazon turned benchmarks into a process, not a dataset.

Deep Dive

Amazon's Artificial General Intelligence (AGI) group discovered that evaluating AI-generated research reports is harder than building the AI itself. When human experts—PhD-level specialists in fields like computer science and public health—tried to verify claims from deep research reports, they scored only 60.8% accuracy on a hidden set of known answers. The problem isn't expertise; it's the cognitive burden of synthesizing long-context, cross-document evidence.

To solve this, Amazon developed the audit-then-score protocol, detailed in a new arXiv paper. Instead of treating human labels as fixed ground truth, the protocol lets the AI fact-checker challenge any benchmark claim it disagrees with, submitting concrete evidence and rationale. An auditor (human or AI) then compares the challenger's evidence against the original label. This iterative process boosted benchmark accuracy from 60.8% to 90.9%. The paper also introduces DeepFact-Bench (a shared test set) and DeepFact-Eval (a system for verifying literature support of claims).

Key Points
  • Amazon AGI's audit-then-score protocol challenges static ground truth, letting AI fact-checkers submit evidence when they disagree with human labels.
  • Unassisted PhD-level experts achieved only 60.8% accuracy on hidden claims, revealing the difficulty of verifying deep-research reports.
  • The protocol boosted benchmark accuracy to 90.9% by turning evaluation into an adaptive, iterative process involving human and AI auditors.

Why It Matters

As AI generates complex research, static benchmarks fail; this protocol paves the way for trustworthy, scalable evaluation systems.