Audio & Speech

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

arXiv eess.AS May 05, 2026

⚡Training-free technique rebalances text and perceptual inputs at inference time

Deep Dive

A persistent flaw in multimodal large language models (MLLMs) is their tendency to hallucinate—generating outputs that ignore or contradict visual or audio inputs. This happens because textual tokens dominate during inference, causing the model to lean on language priors rather than grounded perceptual evidence. Researchers from Bar-Ilan University have introduced LIME (Learning Inference-time Modality Enhancement), a training-free framework that directly tackles this imbalance. LIME leverages Layer-wise Relevance Propagation (LRP) to measure each token's contribution from each modality, then applies a relevance-based objective during decoding to encourage greater reliance on perceptual inputs. Crucially, this is achieved by updating the model's key-value representations at inference time, with no parameter changes or additional training data required.

LIME was evaluated across multiple multimodal benchmarks spanning both vision and audio domains. The results show consistent reductions in hallucination rates and improved grounding of generated content, all while maintaining or even slightly improving generation quality. The technique boosts modality contribution and produces more localized, semantically aligned relevance patterns. Because LIME operates entirely during inference and requires no retraining, it offers a practical drop-in solution for existing MLLMs. This work addresses a critical reliability bottleneck, potentially making multimodal AI systems more trustworthy for applications like image captioning, video understanding, and audio transcription.

Key Points

LIME uses Layer-wise Relevance Propagation to quantify token contributions from each modality
It modifies key-value representations at inference time without retraining or extra data
Achieves consistent hallucination reduction across vision and audio benchmarks while maintaining output quality

Why It Matters

A practical, no-retrain fix for one of AI's biggest reliability issues in multimodal systems

Read Original Article

Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time

Why It Matters

Stay Ahead in AI