Errors in AI-Assisted Retrieval of Medical Literature: A Comparative Study
Grok-2 performed best but still failed to find correct references nearly half the time.
A comprehensive study led by researchers from Rutgers University and the University of Washington has quantified significant reliability issues with AI-assisted medical literature retrieval. The team tested five leading LLM platforms—Grok-2, ChatGPT GPT-4.1, Google Gemini Flash 2.5, Perplexity AI, and DeepSeek GPT-4—on 2,000 references from 40 recent articles in top medical journals including BMJ, JAMA, and NEJM. The results were alarming: LLMs completely failed to retrieve correct reference data 47.8% of the time, with an average accuracy score of just 0.29 on a scale where 1.25 represents perfect performance.
Performance varied dramatically between models. Grok-2 achieved the highest accuracy score of 0.57, while Google Gemini Flash 2.5 scored a dismal 0.11. The study also found that references from The New England Journal of Medicine were particularly challenging for AI systems, resulting in lower accuracy scores and higher complete miss rates compared to BMJ articles. Multivariable analysis confirmed that both the specific LLM platform and the source journal were independently associated with retrieval performance.
These findings have serious implications for medical research and clinical practice. The study demonstrates that while AI tools can accelerate literature review, they introduce substantial error rates that could compromise research integrity and clinical decision-making. The authors emphasize that bibliographic data generated by LLMs must be carefully reviewed and verified, as relying on unverified AI-generated references could lead to incorrect citations, flawed literature reviews, and potentially dangerous clinical recommendations based on misrepresented evidence.
- LLMs completely failed to retrieve correct medical references 47.8% of the time across 2,000 test cases
- Performance varied widely: Grok-2 scored 0.57 accuracy while Google Gemini Flash 2.5 scored only 0.11
- References from NEJM were particularly challenging, with lower accuracy scores than BMJ articles
Why It Matters
Medical professionals and researchers must verify all AI-generated references to prevent citation errors that could compromise research integrity.