LLM judges fabricate rationales: Study finds cue-invariance failure, proposes fix
LLM judges rank based on cues, not content—1,000 summary study reveals.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new arXiv paper by Tapwal, Kumar, and Maple asks a critical question: are LLM judges faithful to the text or do they fabricate rationales based on spurious cues? The authors introduce a causal framework with five novel cue interventions—Blind, Truth, Flip, Placebo, and Reveal-After—to probe whether judge rankings and explanations remain stable when non-evidential cues are perturbed. Testing on a dataset of 1,000 summaries from both extractive models and LLMs, they find substantial cue-anchored rationalization: judges adjust rankings based on verbosity and confidence cues rather than true content. The study also introduces tie-aware metrics like label-aligned rhetoric and explanation drift to quantify both outcome anchoring and rationale anchoring.
To combat this, the team proposes a structured mitigation called PROOF-BEFORE-PREFERENCE, which enforces an evidence-lock phase before scoring and ranking. Compared to standard chain-of-thought prompting, this method markedly improves cue invariance, reducing rationalization bias by anchoring decisions to actual text evidence. The findings have major implications for automated evaluation pipelines in summarization and dialogue, where LLM judges are increasingly relied upon for benchmarks. The paper suggests that without such safeguards, judges may systematically prefer style over substance, undermining the validity of evaluation results.
- Introduces 5 cue interventions (Blind, Truth, Flip, Placebo, Reveal-After) to test judge cue invariance across 1,000 summaries.
- Finds substantial rationalization bias: judges anchor rankings on non-evidential cues like verbosity and confidence rather than underlying text.
- PROOF-BEFORE-PREFERENCE (evidence lock, score, rank) outperforms chain-of-thought prompting, improving cue invariance significantly.
Why It Matters
LLM judges are unreliable evaluators—this framework helps ensure AI evaluations reflect actual content, not spurious cues.