Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
Training-free technique rebalances text and perceptual inputs at inference time
A persistent flaw in multimodal large language models (MLLMs) is their tendency to hallucinate—generating outputs that ignore or contradict visual or audio inputs. This happens because textual tokens dominate during inference, causing the model to lean on language priors rather than grounded perceptual evidence. Researchers from Bar-Ilan University have introduced LIME (Learning Inference-time Modality Enhancement), a training-free framework that directly tackles this imbalance. LIME leverages Layer-wise Relevance Propagation (LRP) to measure each token's contribution from each modality, then applies a relevance-based objective during decoding to encourage greater reliance on perceptual inputs. Crucially, this is achieved by updating the model's key-value representations at inference time, with no parameter changes or additional training data required.
LIME was evaluated across multiple multimodal benchmarks spanning both vision and audio domains. The results show consistent reductions in hallucination rates and improved grounding of generated content, all while maintaining or even slightly improving generation quality. The technique boosts modality contribution and produces more localized, semantically aligned relevance patterns. Because LIME operates entirely during inference and requires no retraining, it offers a practical drop-in solution for existing MLLMs. This work addresses a critical reliability bottleneck, potentially making multimodal AI systems more trustworthy for applications like image captioning, video understanding, and audio transcription.
- LIME uses Layer-wise Relevance Propagation to quantify token contributions from each modality
- It modifies key-value representations at inference time without retraining or extra data
- Achieves consistent hallucination reduction across vision and audio benchmarks while maintaining output quality
Why It Matters
A practical, no-retrain fix for one of AI's biggest reliability issues in multimodal systems