AuthorityBench: Benchmarking LLM Authority Perception for Reliable Retrieval-Augmented Generation
New benchmark reveals how well models like GPT-4 judge source authority, boosting RAG reliability by up to 30%.
A team of researchers has introduced AuthorityBench, a comprehensive new benchmark designed to test a critical but often overlooked capability in Large Language Models (LLMs): their ability to perceive the authority of information sources. This goes beyond simple semantic understanding and is crucial for Retrieval-Augmented Generation (RAG) systems, which can be misled by low-quality or misleading external data. The benchmark comprises three distinct datasets: DomainAuth (10K web domains ranked by PageRank), EntityAuth (22K entities ranked by popularity), and RAGAuth (120 queries with documents of varying authority for real-world evaluation).
The researchers evaluated five leading LLMs using three different judgment methods: PointJudge, PairJudge, and ListJudge. The results showed that PairJudge and ListJudge methods, when combined with a PointScore output format, achieved the strongest correlation with ground-truth authority rankings. Notably, the study found that incorporating the actual text content of webpages consistently degraded an LLM's judgment performance, suggesting that authority perception is a distinct skill separate from analyzing writing style.
In practical downstream tests on RAG systems, the research demonstrated that filtering retrieved documents based on the LLM's authority perception led to substantial improvements in answer accuracy. This validates the core hypothesis: teaching models to weigh source credibility is a direct path to more reliable and trustworthy AI knowledge retrieval, moving beyond the current paradigm where any retrieved text is treated with equal weight.
- AuthorityBench is a new benchmark with three datasets (DomainAuth, EntityAuth, RAGAuth) totaling over 32,000 data points to test LLM source authority perception.
- Testing five LLMs revealed PairJudge and ListJudge methods are most effective, while analyzing webpage text actually hurts judgment accuracy.
- Applying authority-guided filtering in RAG systems was shown to significantly improve answer accuracy, proving the practical value of this capability.
Why It Matters
This work provides a concrete method to build more reliable RAG systems that can filter out misinformation, directly impacting enterprise search and AI assistants.