LIM is a lightweight cross-attention module that generates imaginary visual embeddings from text-only inputs?

LIM is a lightweight cross-attention module that generates imaginary visual embeddings from text-only inputs.

Research & Papers

KAIST's LIM module fixes VLM accuracy when images are missing

arXiv cs.CL May 14, 2026

⚡Vision-language models fail without images; a new module restores accuracy and calibration.

Deep Dive

Vision-Language Models (VLMs) are typically trained on paired image-text data, but real-world deployment often forces them to operate on text-only inputs. Researchers from KAIST found that removing the vision modality causes large accuracy drops and severe miscalibration—the model no longer behaves like its original language backbone. Even when text descriptions preserve semantic content, confidence becomes unreliable. This "missing-modality gap" undermines trust in VLM outputs.

To solve this, the team proposes the Latent Imagination Module (LIM), a lightweight cross-attention module that predicts imagined latent embeddings from textual input alone, feeding them into a frozen VLM backbone without requiring pixel-level image synthesis. Across text-only benchmarks, unseen tasks, and missing-image scenarios, LIM significantly improves accuracy and reduces calibration error. The paper (9 pages, 16 figures) will be presented at the ICLR 2026 Workshop on Principled Design for Trustworthy AI.

Key Points

LIM is a lightweight cross-attention module that generates imaginary visual embeddings from text-only inputs.
Outperforms baselines on text-only benchmarks, unseen tasks, and missing-image scenarios by improving accuracy and calibration.
Accepted at the ICLR 2026 Workshop on Principled Design for Trustworthy AI (KAIST researchers).

Why It Matters

Enables reliable VLM inference without images, critical for text-only deployments in production systems.

Read Original Article

KAIST's LIM module fixes VLM accuracy when images are missing

Why It Matters

Related Articles

🚀 Stay Ahead in AI