New technique cuts VLM hallucinations by 38% via redundancy amplification
Converting unique visual info into redundant data makes models far more robust
Vision-language models often hallucinate or break when one modality (e.g., image) is ambiguous, corrupted, or missing. A new ICML 2026 paper by Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, and Roy Ka-Wei Lee tackles this by analyzing information redundancy across modalities. They categorize interactions as redundant (shared info), unique (exclusive to one modality), or synergistic (emergent from both). Their key insight: amplifying redundant interactions can compensate for impaired modalities, but most current training datasets strip away redundancy in favor of visual grounding.
To restore this redundancy, the team builds a self-captioning workflow with a Multimodal Interaction Gate that converts unique modality-specific information into redundant representations. This allows the model to rely on shared cues when original visual input is degraded. Experimental results show visual-induced errors drop 38.3% and cross-modal consistency rises 16.8%. The approach is model-agnostic and can be applied to existing VLMs like CLIP or LLaVA. Accepted to ICML 2026, this work provides a practical path to more robust multimodal AI without requiring new architectures or massive retraining.
- Reduces visual-induced errors in vision-language models by 38.3%
- Improves cross-modal consistency by 16.8% using a novel Multimodal Interaction Gate
- Converts unique modality-specific info into redundant shared data via self-captioning without extra human annotation
Why It Matters
Makes vision-language models far more reliable for real-world use where data is often corrupted or ambiguous