SURE-RAG: Sufficiency and Uncertainty-Aware Evidence Verification for Selective Retrieval-Augmented Generation
New framework ensures RAG only answers when retrieved evidence truly supports the claim.
Retrieval-augmented generation (RAG) grounds AI answers in external passages, but retrieval alone doesn't verify whether the evidence truly supports a given answer. Current systems often proceed with topical-but-insufficient passages, leading to incorrect or unsafe responses. SURE-RAG tackles this gap by framing it as evidence sufficiency verification: given a question, candidate answer, and retrieved evidence, it predicts support, refutation, or insufficiency, and abstains unless support is confirmed. The protocol aggregates pair-level claim-evidence signals into interpretable answer-level scores including coverage, relation strength, disagreement, conflict, and retrieval uncertainty.
On the controlled multi-hop benchmark HotpotQA-RAG v3, calibrated SURE-RAG achieves 0.9075 Macro-F1 (0.8951±0.0069), substantially above DeBERTa mean-pooling (0.6516) and a GPT-4o judge (0.7284), while matching an opaque cross-encoder (0.8888±0.0109) with full auditability. Risk at 30% coverage drops from 0.2588 to 0.1642 — a 37% reduction in unsafe answers. Crucially, the paper also shows that sufficiency verification and hallucination detection are distinct problems: on HaluBench, SURE-RAG and GPT-4o rankings reverse (unsafe-F1 0.3343 vs 0.7389), highlighting the need for task-specific evaluation. This work provides a practical, transparent mechanism for selective RAG answering.
- SURE-RAG achieves 0.9075 Macro-F1 on HotpotQA-RAG v3, outperforming DeBERTa (0.6516) and GPT-4o judges (0.7284)
- Reduces risk of unsafe answers by 37% at 30% coverage (from 0.2588 to 0.1642)
- Provides full auditability through interpretable signals like coverage, conflict, and uncertainty, unlike opaque cross-encoders
Why It Matters
Makes RAG systems more reliable by ensuring answers are backed by sufficient evidence, cutting unsafe responses by over a third.