Research & Papers

IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering

New 3,230-page benchmark reveals combining text and image search boosts recall by 3-6% over single modalities.

Deep Dive

A research team led by Connor Shorten has released IRPAPERS, a comprehensive benchmark for evaluating visual document processing systems in scientific contexts. The benchmark contains 3,230 pages from 166 scientific papers, each available as both an image and OCR transcription, with 180 carefully constructed 'needle-in-the-haystack' questions for testing retrieval and question-answering capabilities.

The study reveals that while text-based retrieval using Arctic 2.0 embeddings and BM25 achieved 46% Recall@1, and image-based retrieval reached 43%, combining both modalities through hybrid search boosted performance to 49% Recall@1. The most impressive result came from Cohere's Embed v4 model, which achieved 58% Recall@1 using page image embeddings alone, outperforming Voyage 3 Large text embeddings and all open-source models tested. For question answering, text-based RAG systems showed higher ground-truth alignment (0.82 vs 0.71), but both modalities benefited significantly from increased retrieval depth.

This research matters because visual documents—PDFs, scanned papers, and formatted reports—represent a massive portion of enterprise and scientific knowledge that traditional text-only systems struggle to process effectively. The benchmark demonstrates that multimodal approaches aren't just theoretical improvements but deliver measurable gains in recall accuracy. As organizations increasingly rely on RAG systems for knowledge management, IRPAPERS provides concrete evidence that investing in multimodal capabilities pays dividends, particularly for complex domains like scientific research where documents contain crucial visual information alongside text.

Key Points
  • Multimodal hybrid search combining text and image retrieval achieved 49% Recall@1, outperforming text-only (46%) and image-only (43%) approaches
  • Cohere Embed v4 page image embeddings delivered best performance at 58% Recall@1, beating Voyage 3 Large text embeddings and all open-source models
  • Text-based RAG systems showed 0.82 ground-truth alignment vs 0.71 for image-based systems, with both benefiting from multi-document retrieval

Why It Matters

Proves multimodal RAG systems deliver measurable accuracy gains for enterprise document search, especially for scientific and technical content.