MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL
New research framework achieves 37.9 nDCG@10 on MM-BRIGHT, beating top vision-language encoders by over 10 points.
A research team led by Mahmoud SalahEldin Kasem has published a paper introducing MARVEL (Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL), a novel AI pipeline designed to solve the persistent challenge of retrieving relevant text documents based on complex, multimodal queries that combine images and text. The core problem is that even the best vision-language encoders achieve only a 27.6 nDCG@10 score on the reasoning-intensive MM-BRIGHT benchmark, underperforming text-only systems. The researchers argue that effective multimodal search requires three tightly integrated capabilities that current approaches handle in isolation: expanding a query's latent intent, retrieving with a model trained for complex reasoning, and reranking results via explicit step-by-step reasoning.
MARVEL's architecture unifies these three stages. First, it uses an LLM to expand the user's query. Second, it employs a custom fine-tuned dense retriever called MARVEL-Retriever, which is specifically trained to handle complex multimodal reasoning. Finally, it reranks the candidate documents using GPT-4o to perform chain-of-thought reasoning, with an optional multi-pass reciprocal rank fusion step for further refinement. Evaluated across 29 diverse technical domains within the MM-BRIGHT benchmark, MARVEL achieved a state-of-the-art score of 37.9 nDCG@10. This represents a significant +10.3 point improvement over the previous best multimodal encoder and outperformed all single-stage baselines in 27 out of 29 domains, only matching baselines in highly specialized fields like Cryptography and Quantum Computing. The results strongly validate the team's hypothesis that a unified expand-retrieve-rerank framework is superior for reasoning-intensive multimodal information retrieval tasks.
- Achieves 37.9 nDCG@10 on MM-BRIGHT benchmark, a +10.3 point gain over the previous best vision-language encoder.
- Unifies three stages: LLM query expansion, a reasoning-enhanced dense retriever (MARVEL-Retriever), and GPT-4o-based chain-of-thought reranking.
- Outperforms single-stage baselines in 27 of 29 technical domains, proving the effectiveness of its integrated pipeline design.
Why It Matters
This framework significantly improves AI's ability to search technical documents using complex image-and-text queries, advancing multimodal RAG systems.