Not All That Is Fluent Is Factual: Investigating Hallucinations of Large Language Models in Academic Writing
Gemini and ChatGPT score lower Hallucination Index but fail factual tasks.
Researchers from multiple institutions evaluated four popular large language models—ChatGPT, Grok, Gemini, and Copilot—for their tendency to hallucinate when generating academic content. They designed 80 prompts across four categories: reference generation, factual explanation, abstract generation, and writing improvement. Each model's output was scored on a 0–5 rubric covering factual accuracy, reference validity, coherence, style consistency, and academic tone. A novel weighted metric, the Hallucination Index (HI), was introduced to capture overall hallucination severity.
The results revealed distinct trade-offs. Grok and Copilot performed best on reference generation tasks but struggled with abstract and stylistic prompts, yielding HI scores of 0.67 and 0.70 respectively. Gemini and ChatGPT demonstrated stronger tone control and stylistic consistency, but showed higher hallucination risk in factual tasks, with HI scores of 0.53 and 0.57. Notably, the study found that hallucination behavior is not solely dependent on model architecture—task type and prompting conditions play a significant role. This suggests no single model is universally reliable for academic writing; users must choose based on the specific task and prompt design.
- 80 prompts tested across 4 categories: reference generation, factual explanation, abstract generation, and writing improvement.
- Grok and Copilot had Hallucination Index scores of 0.67 and 0.70; Gemini and ChatGPT scored 0.53 and 0.57.
- Hallucination risk varies by task type and prompting, not just model architecture.
Why It Matters
Professionals using AI for academic writing must choose models based on task, not just brand.