ReCQR: Incorporating conversational query rewriting to improve Multimodal Image Retrieval
New technique rewrites messy, multi-turn user queries into clean prompts, improving AI's ability to find the right image.
A team of researchers has published a paper on ReCQR, a novel method that addresses a core weakness in modern image search: vague, conversational user queries. Existing multimodal retrieval systems often struggle when users ask long, ambiguous questions or reference previous parts of a dialogue. ReCQR tackles this by introducing a conversational query rewriting (CQR) task, where an AI agent takes a user's final, messy request within a full chat history and rewrites it into a clear, standalone prompt optimized for retrieval.
The researchers first built a dedicated dataset to train and test this approach. They leveraged Large Language Models (LLMs) to generate candidate rewritten queries at scale, then used an 'LLM-as-Judge' mechanism combined with manual review to curate approximately 7,000 high-quality, multi-turn multimodal dialogues. This dataset, named the ReCQR dataset, provides a new benchmark for the field.
Experiments benchmarking state-of-the-art multimodal models on this new dataset show that the CQR technique delivers significant gains. By clarifying user intent and distilling complex dialogue into concise semantics, it enhances the accuracy of traditional image retrieval models. The work provides a new framework for modeling real-world user interactions in AI systems, moving beyond single, perfect queries to handle the messy reality of human conversation.
- Introduces Conversational Query Rewriting (CQR) to refine vague, multi-turn user questions into clear image search prompts.
- Constructs a novel dataset of ~7,000 high-quality dialogues using LLM generation and a hybrid LLM/human review process.
- Demonstrates that this rewriting step significantly improves the accuracy of existing multimodal image retrieval models.
Why It Matters
Makes AI image search more practical by understanding real, conversational human queries instead of requiring perfectly phrased prompts.