Research & Papers

How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations

New research audits 10 LLMs, finds citation hallucination rates vary wildly from 11.4% to 56.8%.

Deep Dive

A new preprint study titled 'How LLMs Cite and Why It Matters' provides the largest cross-model audit to date on a critical flaw in AI-assisted academic writing: the fabrication of scholarly references. Led by researcher M.Z. Naser, the work systematically prompted 10 commercially deployed large language models (LLMs) across four academic domains, generating 69,557 citation instances that were then verified against major scholarly databases like CrossRef and Semantic Scholar. The audit reveals that citation hallucination is not an intrinsic behavior—models do not spontaneously generate citations when unprompted—but is instead strongly induced by user prompts, with rates varying dramatically based on the specific model, academic field, and how the request is framed.

The findings show hallucination rates span a fivefold range, from 11.4% to 56.8%, indicating significant reliability issues for researchers using AI tools. Crucially, the study demonstrates that deploying newer model generations does not guarantee improvement in this area. To combat the problem, the researchers identified two practical, low-cost detection filters: multi-model consensus (where agreement from more than 3 LLMs yields 95.6% accuracy) and within-prompt repetition. Furthermore, they developed a lightweight classifier trained solely on bibliographic string features, achieving an AUC of 0.876 in cross-validation, which can screen for phantom citations at inference time without querying external databases. This work provides essential metrics and tools for academics and developers aiming to build more trustworthy AI writing assistants.

Key Points
  • Audit of 10 commercial LLMs found citation hallucination rates range from 11.4% to 56.8% across 69,557 generated instances.
  • Identified two practical detection filters: multi-model consensus (95.6% accuracy) and within-prompt repetition (88.9% accuracy).
  • Developed a lightweight classifier using only bibliographic features, achieving AUC 0.876, deployable for real-time screening.

Why It Matters

Provides critical data and tools for academics to trust AI-generated references and for developers to build more reliable assistants.