MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation
New graph-based RAG system achieves state-of-the-art performance while slashing costs by 24x and speeding up processing 43x.
A research team led by Sijun Dai has introduced MG²-RAG (Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation), a novel framework that addresses critical limitations in current multimodal AI systems. While traditional RAG helps reduce hallucinations in Multimodal Large Language Models (MLLMs), existing approaches struggle with complex cross-modal reasoning—flat vector retrieval ignores structural dependencies, and graph-based methods rely on costly "translation-to-text" pipelines that discard fine-grained visual information. MG²-RAG tackles these problems by constructing hierarchical multimodal knowledge graphs that combine lightweight textual parsing with entity-driven visual grounding, creating unified multimodal nodes that preserve atomic evidence from both text and visual sources.
The framework's core innovation lies in its multi-granularity graph retrieval mechanism, which aggregates dense similarities and propagates relevance across the graph to support structured multi-hop reasoning. This enables the system to perform complex inferences that connect textual concepts with specific visual regions. In extensive experiments across four representative multimodal tasks—retrieval, knowledge-based visual question answering (VQA), reasoning, and classification—MG²-RAG consistently achieved state-of-the-art performance while dramatically reducing computational overhead. Most impressively, the system delivered an average 43.3× speedup and 23.9× cost reduction compared to advanced graph-based frameworks, making sophisticated multimodal reasoning significantly more accessible and practical for real-world applications.
- MG²-RAG constructs hierarchical multimodal knowledge graphs that fuse text entities with visual regions into unified nodes, preserving fine-grained evidence
- The system achieves state-of-the-art performance across four multimodal tasks while delivering 43.3× faster graph construction and 23.9× cost reduction
- Introduces multi-granularity graph retrieval that aggregates similarities and propagates relevance to support structured multi-hop reasoning
Why It Matters
Enables more accurate and efficient multimodal AI reasoning for applications like visual search, document analysis, and complex Q&A systems at dramatically lower cost.