MM-BizRAG beats SOTA by 32% on enterprise Q&A with structure-aware parsing
New multimodal RAG handles reports and slide decks differently, boosting accuracy dramatically
Most multimodal RAG systems treat all documents as images, losing the rich structure of enterprise documents like reports and slide decks. MM-BizRAG, accepted at ACL 2026 Industry Track, takes a different approach: it proactively extracts document structure through a structure-aware split that routes documents into orientation-specific ingestion pipelines. For vertically structured documents (e.g., reports), it applies explicit layout-aware parsing; for horizontally structured ones (e.g., slide decks), it uses holistic page-level representations. A unified LLM-driven artifact transformation pipeline with placeholder-based positional alignment preserves natural reading order, while inference-time multimodal assembly decouples retrieval representations from generation context—enabling richer, more grounded answers without any finetuning requirement.
Tested on a large, heterogeneous enterprise dataset and two public benchmarks (SlideVQA and FinRAGBench-V), MM-BizRAG consistently outperforms state-of-the-art vision-centric baselines by up to 32% points, with especially strong gains on report-style layouts. The authors also introduce FastRAGEval, a single-call LLM Judge metric for fine-grained generative recall that halves RAGChecker's cost while achieving stronger human alignment. This work demonstrates that explicit structure handling is key for enterprise-grade document Q&A, and the no-finetuning requirement makes it practical for real-world deployment.
- Introduces document structure-aware splitting with orientation-specific ingestion pipelines for reports vs. slide decks.
- Achieves up to 32% point improvement over SOTA vision-centric baselines on enterprise datasets and public benchmarks.
- Proposes FastRAGEval metric that halves cost of RAGChecker with stronger human alignment for generative recall.
Why It Matters
Enables accurate, structure-aware AI Q&A on complex enterprise documents without costly finetuning.