Research & Papers

Quantifying Hallucinations in Language Language Models on Medical Textbooks

Even with textbook passages provided, Meta's powerful open-source model generated factually incorrect answers nearly 1 in 5 times.

Deep Dive

A new study from the National Library of Medicine (NLM) provides a sobering quantitative look at how often large language models invent false information, or 'hallucinate,' when answering questions grounded in medical textbooks. The research, led by Brandon C. Colelough, Davis Bartels, and Dina Demner-Fushman, specifically tested Meta's prominent open-source model, LLaMA-70B-Instruct. In their first experiment, the model was provided with textbook passages as evidence before answering novel medical questions. The results were striking: LLaMA-70B hallucinated in 19.7% of its answers (95% CI 18.6 to 20.7). This high error rate occurred even though human evaluators found 98.8% of the model's responses to be maximally plausible, revealing a dangerous disconnect between an answer's convincing tone and its factual correctness.

The second experiment expanded the analysis across multiple models and introduced clinician evaluations. The researchers found a strong negative correlation (ρ=-0.71, p=0.058) between a model's hallucination rate and its 'usefulness' score as rated by medical professionals. This suggests that reducing factual inaccuracies is directly linked to practical utility in high-stakes fields like medicine. The study also reported high agreement among clinicians when scoring responses, with a quadratic weighted kappa of 0.92, indicating the evaluations were consistent and reliable.

This research is significant because it moves beyond general benchmarks to test LLMs against a fixed, authoritative evidence source—medical textbooks—in a controlled QA setting. It quantifies a core reliability problem: models can sound extremely convincing while being factually wrong nearly 20% of the time in a specialized domain. The findings underscore the urgent need for better mitigation techniques, such as improved retrieval-augmented generation (RAG) and fact-checking mechanisms, before these tools can be safely deployed for clinical decision support or patient education.

Key Points
  • Meta's LLaMA-70B-Instruct hallucinated on 19.7% of medical QA tasks even with textbook evidence provided.
  • A strong correlation (ρ=-0.71) was found: lower hallucination rates directly aligned with higher clinician usefulness scores.
  • Clinicians showed high scoring agreement (κ=0.92), but 98.8% of the model's incorrect answers still sounded maximally plausible.

Why It Matters

For professionals in medicine, law, and finance, this quantifies the real risk of deploying seemingly plausible but factually flawed AI-generated advice.