Research & Papers

LITTA: Late-Interaction and Test-Time Alignment for Visually-Grounded Multimodal Retrieval

New framework improves retrieval accuracy for complex documents without retraining models, using LLM-generated query variants.

Deep Dive

Seonok Kim's new research paper introduces LITTA (Late-Interaction and Test-Time Alignment), a framework designed to solve a persistent problem in AI: retrieving the right pages from complex, visually rich documents. Technical manuals, textbooks, and scientific reports present unique challenges with their long contexts, intricate layouts, and the fact that a user's question might not use the same words as the supporting evidence. LITTA tackles this by employing a simple yet powerful query-expansion strategy at test time, meaning it works with existing systems without costly retraining.

Given a single user query, LITTA uses a large language model (LLM) to generate several complementary phrasings of the same question. Each variant is then sent through a pre-trained, 'frozen' multimodal retriever—a model that understands both text and images—which scores candidate pages. The key innovation is the aggregation step: results from all query variants are combined using reciprocal rank fusion, which improves evidence coverage and makes the system less sensitive to how the initial question was worded. This approach proved particularly effective in domains with high visual and semantic variability.

The evaluation across three challenging domains—computer science, pharmaceuticals, and industrial manuals—showed consistent improvements in key metrics like top-k accuracy and Mean Reciprocal Rank (MRR) compared to standard single-query retrieval. A major practical advantage is the controllable accuracy-efficiency trade-off; developers can adjust the number of query variants based on their latency requirements. This makes LITTA a deployable solution that enhances robustness in real-world applications like enterprise knowledge bases or technical support systems, demonstrating that smarter query strategies can significantly boost multimodal AI performance.

Key Points
  • Uses LLM query expansion to generate multiple phrasings from a single user question, improving evidence coverage.
  • Aggregates results using reciprocal rank fusion on a frozen vision retriever, requiring no model retraining.
  • Showed consistent accuracy gains across computer science, pharmaceutical, and industrial manual domains with controllable latency.

Why It Matters

Enables more accurate AI assistants for technical documentation, manuals, and scientific reports without expensive system retraining.