ReportLogic: Evaluating Logical Quality in Deep Research Reports
New AI benchmark reveals LLM-generated research reports often lack verifiable logic despite appearing fluent.
A research team led by Jujia Zhao has published ReportLogic, a new benchmark designed to rigorously evaluate the logical soundness of deep research reports generated by Large Language Models (LLMs). Published on arXiv, the work addresses a critical gap: while users increasingly rely on LLMs to synthesize complex information into structured reports, current evaluation methods focus on fluency and informativeness, overlooking whether claims are explicitly supported and logically coherent—a necessity for trustworthy downstream use.
The core of ReportLogic is a hierarchical, reader-centric taxonomy that audits logic across three levels. 'Macro-Logic' evaluates if the report has a coherent, on-topic structure with a unified analytical arc. 'Expositional-Logic' assesses whether the progression of ideas provides necessary context for understanding. Finally, 'Structural-Logic' checks if conclusions can be verified through explicit claim-support relationships. The team constructed a human-annotated dataset using this rubric and trained an open-source evaluator called LogicJudge for scalable assessment.
A key finding is the vulnerability of standard LLM-based judges. The researchers conducted adversarial attacks, showing that models like GPT-4 are frequently influenced by superficial cues such as verbosity or certain reasoning modes, which can mask broken logical support within a report. This exposes a significant reliability risk in automated evaluation. The ReportLogic benchmark and dataset provide actionable guidance for developers to build more robust logic evaluators, ultimately aiming to improve the auditability and trustworthiness of AI-generated analytical content in professional and academic settings.
- Introduces a 3-level taxonomy (Macro, Expositional, Structural Logic) to audit report coherence and verifiability.
- Includes a human-annotated dataset and an open-source 'LogicJudge' model for scalable evaluation.
- Finds off-the-shelf LLM judges are easily fooled by verbosity, highlighting a critical evaluation gap.
Why It Matters
Provides a framework to build more trustworthy AI research assistants, crucial for high-stakes decision-making in business and academia.