Research & Papers

CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

New benchmark reveals LLMs fabricate plausible references, exposing a critical vulnerability in peer-reviewed science.

Deep Dive

Researchers from institutions including the University of Notre Dame have launched CiteAudit, the first comprehensive benchmark designed to detect hallucinated scientific references generated by large language models (LLMs). The project addresses a growing crisis in academic publishing, where fabricated citations that appear plausible but correspond to no real publication have already infiltrated submissions and even accepted papers at major machine learning conferences. As manually verifying the exploding volume of references becomes impossible, CiteAudit provides a critical, automated infrastructure to safeguard the attribution and integrity of scientific research, exposing a fundamental vulnerability in the current peer-review process.

The CiteAudit framework employs a sophisticated multi-agent verification pipeline that decomposes the complex task of citation checking into distinct, manageable steps: claim extraction, evidence retrieval, passage matching, reasoning, and a final calibrated judgment on whether a cited source genuinely supports its associated claim. The team constructed a large-scale, human-validated dataset across multiple scientific domains to train and evaluate the system, defining unified metrics for citation faithfulness and evidence alignment. In experiments, the framework demonstrated a significant performance leap over prior methods in both accuracy and interpretability. This work provides publishers, reviewers, and researchers with the first practical, scalable tool to audit the trustworthiness of references, marking a necessary evolution in quality control for the AI-augmented scientific era.

Key Points
  • First benchmark for detecting LLM-hallucinated citations, with a human-validated dataset across scientific domains.
  • Uses a multi-agent pipeline for verification, outperforming prior methods in accuracy and interpretability.
  • Reveals substantial citation errors in state-of-the-art LLMs, exposing a critical flaw in modern scientific publishing.

Why It Matters

Provides essential infrastructure for journals and conferences to maintain scientific integrity as AI-generated content proliferates.