KAIST's LIM module fixes VLM accuracy when images are missing
Vision-language models fail without images; a new module restores accuracy and calibration.
Vision-Language Models (VLMs) are typically trained on paired image-text data, but real-world deployment often forces them to operate on text-only inputs. Researchers from KAIST found that removing the vision modality causes large accuracy drops and severe miscalibration—the model no longer behaves like its original language backbone. Even when text descriptions preserve semantic content, confidence becomes unreliable. This "missing-modality gap" undermines trust in VLM outputs.
To solve this, the team proposes the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input alone, feeding them into a frozen VLM backbone without requiring pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM significantly improves accuracy and reduces calibration error. The paper (9 pages, 16 figures) will be presented at the ICLR 2026 Workshop on Principled Design for Trustworthy AI.
- LIM is a lightweight cross-attention module that generates imaginary visual embeddings from text-only inputs.
- Outperforms baselines on text-only benchmarks, unseen tasks, and missing-image scenarios by improving accuracy and calibration.
- Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI (KAIST researchers).
Why It Matters
Enables reliable VLM inference without images, critical for text-only deployments in production systems.