Research & Papers

When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

AI judges consistently overrate irrelevant passages by 20-40%, favoring length over accuracy.

Deep Dive

Researchers Chuting Yu, Hang Li, Joel Mackenzie, and Teerapong Leelanupab published a study (arXiv:2602.17170) showing LLMs used as relevance judges systematically inflate scores. Models like GPT-4 and Claude assign high-confidence scores to irrelevant passages, showing bias toward length and lexical cues rather than true information need. This reveals LLMs are unreliable drop-in replacements for human assessors in information retrieval evaluation, requiring new diagnostic frameworks.

Why It Matters

This undermines automated evaluation of search engines and RAG systems, forcing teams to reconsider AI-powered quality metrics.