CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval
New multimodal model searches code with screenshots and diagrams, boosting RAG performance by 10 points.
A research team has introduced CodeMMR, a novel AI model designed to bridge the gap between natural language, programming code, and visual artifacts like screenshots and diagrams. Existing code search and retrieval-augmented generation (RAG) systems are largely text-centric, ignoring the rich visual context inherent in software development, such as UI mockups, data visualizations, and UML diagrams. To address this, the researchers first created MMCoIR, the first comprehensive benchmark for evaluating multimodal code information retrieval across five visual domains and eight programming languages. This benchmark revealed the significant challenge of the task, setting the stage for their solution.
CodeMMR tackles this by using instruction-based multimodal alignment to project text, code, and images into a single, shared semantic space. This allows a developer to, for instance, search for relevant code snippets by uploading a screenshot of a desired UI component or a schematic diagram. The model demonstrates strong generalization, outperforming competitive baselines like UniIR and VLM2Vec by an average of 10 points on the nDCG@10 retrieval metric. Furthermore, integrating CodeMMR into RAG pipelines for code generation enhances both the fidelity of the generated code and its visual grounding, meaning the output better matches the intended visual structure described in the prompt.
The work, submitted to CVPR 2026, positions multimodal retrieval as a core enabler for next-generation intelligent programming assistants. By understanding the full context of a programming task—including its visual representation—AI tools can move beyond simple text completion to become more reliable partners in software engineering. The associated datasets and model are available on HuggingFace, inviting further development and application in the community.
- Introduces CodeMMR, a model that jointly embeds text, code, and images for unified search, addressing a gap in visual-aware code retrieval.
- Outperforms strong baselines (UniIR, GME, VLM2Vec) by an average of 10 points on nDCG@10 across a new multimodal benchmark (MMCoIR).
- Enhances RAG for code generation, improving output fidelity and visual grounding when generating code from multimodal prompts.
Why It Matters
Enables developers to search codebases with UI screenshots and diagrams, making AI coding assistants more context-aware and accurate.