When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment
AI judges consistently overrate irrelevant passages by 20-40%, favoring length over accuracy.
Researchers Chuting Yu, Hang Li, Joel Mackenzie, and Teerapong Leelanupab published a study (arXiv:2602.17170) showing LLMs used as relevance judges systematically inflate scores. Models like GPT-4 and Claude assign high-confidence scores to irrelevant passages, showing bias toward length and lexical cues rather than true information need. This reveals LLMs are unreliable drop-in replacements for human assessors in information retrieval evaluation, requiring new diagnostic frameworks.
Why It Matters
This undermines automated evaluation of search engines and RAG systems, forcing teams to reconsider AI-powered quality metrics.