CAS Researchers: LLM Rerankers Self-Assess Ranking Quality via Self-Consistency
Self-consistency beats SOTA; verbalized confidence is overconfident in LLM rerankers.
A new paper from CAS explores whether LLM rerankers can predict their own ranking performance without external predictors. The researchers test two training-free approaches: self-consistency across sampled rankings and direct verbalized confidence from the model. Experiments on TREC Deep Learning datasets (2019–2022) using four LLMs reveal that self-consistency is competitive with or better than current state-of-the-art QPP methods and is better calibrated in almost all settings. In contrast, verbalized confidence is severely overconfident, making it unreliable for real-world use.
To improve verbalized confidence, the authors propose two lightweight supervised approaches: Verb-Num and Verb-List. These methods require only a few additional output tokens and produce calibrated ranking-quality estimates without heavy retraining. This work is significant because it enables LLM rerankers to self-audit their output quality autonomously, reducing reliance on external evaluation pipelines. For tech professionals building retrieval systems, this could mean simpler, more reliable quality monitoring in production RAG pipelines.
- Self-consistency across sampled rankings matches or exceeds external SOTA QPP methods on TREC DL 2019–2022.
- Direct verbalized confidence from LLM rerankers is severely overconfident in all four tested models.
- Verb-Num and Verb-List calibrate confidence with just a few extra tokens, enabling lightweight self-assessment.
Why It Matters
Enables LLM rerankers to self-audit quality, reducing need for external evaluation in production retrieval systems.