Research & Papers

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

arXiv cs.AI March 09, 2026

⚡New benchmark and agent system evolves with AI feedback, dramatically improving research factuality verification.

Deep Dive

A team of researchers including Yukun Huang, Leonardo F. R. Ribeiro, and others have published a paper introducing DeepFact, a novel system designed to tackle the critical problem of verifying factuality in AI-generated deep research reports (DRRs). The core challenge they address is that existing fact-checking tools and benchmarks are built for simple, atomic claims and fail when applied to the complex, interconnected claims found in detailed research summaries. To solve this, they propose a co-evolutionary approach where benchmarks and verification agents improve together.

Their key innovation is the Audit-then-Score (AtS) methodology. Instead of using a static, expert-labeled benchmark—which they found to be brittle, with PhD-level experts achieving only 60.8% accuracy on a hidden test set—AtS creates a dynamic, revisable benchmark. When a verification agent disagrees with a benchmark label, it must submit evidence. An auditor (a human expert) then adjudicates the dispute, and accepted revisions update the benchmark *before* models are finally scored. This process turned experts from unreliable one-shot labelers into highly reliable auditors, boosting micro-gold accuracy to 90.9% over four rounds.

The researchers instantiated this methodology into two concrete tools: DeepFact-Bench, a versioned benchmark for DRR factuality with auditable rationales, and DeepFact-Eval, a document-level verification agent. DeepFact-Eval, which includes a faster 'grouped lite' variant, was shown to outperform existing verifiers not only on their new benchmark but also on external factuality datasets, demonstrating strong transfer learning capabilities. This work provides a much-needed, rigorous framework for developing and testing AI systems tasked with producing trustworthy, factual research, moving beyond simplistic factoid checking to handle the nuance of real scholarly work.

Key Points

Proposes Audit-then-Score (AtS), a dynamic method that improved expert fact-labeling accuracy from 60.8% to 90.9% over four revision rounds.
Introduces DeepFact-Bench, a versioned benchmark, and DeepFact-Eval, a verification agent that outperforms existing models on complex research claims.
Solves the brittleness of static benchmarks for AI-generated research reports (DRRs), enabling reliable evaluation of long-form AI factuality.

Why It Matters

Provides a scalable, accurate framework for ensuring AI-generated research and analysis is factual, critical for enterprise and academic adoption.

Read Original Article

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Why It Matters

Stay Ahead in AI