To Believe or Not To Believe: Comparing Supporting Information Tools to Aid Human Judgments of AI Veracity
A new study finds AI-provided justifications make users 50% less likely to spot factual errors.
A team of researchers from CSIRO's Data61 and the University of Melbourne conducted a critical user study, published on arXiv, comparing tools to help humans judge the truthfulness of AI-generated answers. In the context of a data extraction tool, they tested three supporting information methods: showing users the full source document, providing retrieved passages (RAG), and displaying explanations generated by a Large Language Model (LLM). The goal was to measure how these tools affect efficiency, effectiveness, reliance, and trust during veracity assessment.
The study's key finding reveals a significant pitfall of common AI interfaces. While LLM explanations enabled rapid assessments, they fostered inappropriate reliance on the AI, making participants less likely to detect its errors. In contrast, the passage retrieval method offered a 'reasonable compromise,' delivering judgment accuracy comparable to reviewing the full source text but much faster. The researchers also found preliminary evidence that the risk of misplaced trust is worse when evaluating complex answers, highlighting a major concern for high-stakes fields like biomedicine and law that increasingly use on-demand AI analysis.
- LLM-generated explanations led to inappropriate user trust, reducing error detection capability by 50% compared to other methods.
- Passage retrieval (RAG) provided the best balance, with accuracy matching full-source review but with significantly greater speed and efficiency.
- The negative impact of LLM explanations on accurate judgment was more pronounced for complex information needs versus simple ones.
Why It Matters
For professionals in law and medicine, relying on AI explanations could lead to critical oversights, demanding better verification tool design.