DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
New benchmark and agent system evolves with AI feedback, dramatically improving research factuality verification.
A team of researchers including Yukun Huang, Leonardo F. R. Ribeiro, and others have published a paper introducing DeepFact, a novel system designed to tackle the critical problem of verifying factuality in AI-generated deep research reports (DRRs). The core challenge they address is that existing fact-checking tools and benchmarks are built for simple, atomic claims and fail when applied to the complex, interconnected claims found in detailed research summaries. To solve this, they propose a co-evolutionary approach where benchmarks and verification agents improve together.
Their key innovation is the Audit-then-Score (AtS) methodology. Instead of using a static, expert-labeled benchmark—which they found to be brittle, with PhD-level experts achieving only 60.8% accuracy on a hidden test set—AtS creates a dynamic, revisable benchmark. When a verification agent disagrees with a benchmark label, it must submit evidence. An auditor (a human expert) then adjudicates the dispute, and accepted revisions update the benchmark *before* models are finally scored. This process turned experts from unreliable one-shot labelers into highly reliable auditors, boosting micro-gold accuracy to 90.9% over four rounds.
The researchers instantiated this methodology into two concrete tools: DeepFact-Bench, a versioned benchmark for DRR factuality with auditable rationales, and DeepFact-Eval, a document-level verification agent. DeepFact-Eval, which includes a faster 'grouped lite' variant, was shown to outperform existing verifiers not only on their new benchmark but also on external factuality datasets, demonstrating strong transfer learning capabilities. This work provides a much-needed, rigorous framework for developing and testing AI systems tasked with producing trustworthy, factual research, moving beyond simplistic factoid checking to handle the nuance of real scholarly work.
- Proposes Audit-then-Score (AtS), a dynamic method that improved expert fact-labeling accuracy from 60.8% to 90.9% over four revision rounds.
- Introduces DeepFact-Bench, a versioned benchmark, and DeepFact-Eval, a verification agent that outperforms existing models on complex research claims.
- Solves the brittleness of static benchmarks for AI-generated research reports (DRRs), enabling reliable evaluation of long-form AI factuality.
Why It Matters
Provides a scalable, accurate framework for ensuring AI-generated research and analysis is factual, critical for enterprise and academic adoption.