LeMUQ boosts multimodal RAG reliability with 3.8% AUROC gain
New method quantifies uncertainty across text, images, and retrieval steps.
Retrieval Augmented Generation (RAG) extends large language models by pulling in external knowledge, and multimodal RAG adds images to the mix. But answers can still be wrong, and existing uncertainty quantification (UQ) methods—designed for text-only models—fail in these richer scenarios. The key problem: uncertainty can come from retrieval, visual understanding, and generation, and these sources interact.
Also researchers from the arXiv paper present LeMUQ, a Learnable Multimodal UQ method that captures this complexity. It works by probing token-level probabilities under input modifications—like removing the image or dropping retrieved documents. These signals are encoded as special probability tokens and fed into a small finetuned model that learns to combine them. Across experiments with multiple datasets, retrievers, and VLMs, LeMUQ raises the AUROC (area under the receiver operating characteristic curve) by an average of 3.8% over baselines, including finetuned UQ approaches. The method also shows strong generalization to different retrievers and datasets, though transferring across VLMs gives mixed results. Code is available on GitHub, and the authors highlight this as a step toward safer, more reliable multimodal RAG systems.
- LeMUQ improves uncertainty estimation by 3.8% average AUROC across multiple datasets, retrievers, and VLMs.
- It analyzes token probability changes when removing modalities (e.g., text or images) or retrieved context, then learns to combine these signals.
- Strong generalization across retrieval setups and datasets, but mixed results when transferring between different vision-language models.
Why It Matters
Better reliability for multimodal AI systems that combine text, images, and external knowledge.