From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents
A new study of 23,000 queries shows a hybrid retrieval + reranking pipeline achieves 0.816 Recall@5.
A team of researchers has published a comprehensive benchmark comparing modern retrieval methods for RAG systems that handle documents with mixed text and tabular data, a common format in finance, science, and business. Their study, "From BM25 to Corrective RAG," evaluated ten strategies—including sparse retrieval (BM25), dense retrieval, hybrid fusion, cross-encoder reranking, and adaptive methods—on a challenging dataset of 23,088 queries over 7,318 financial documents. The key finding is that a two-stage pipeline, which first uses hybrid retrieval to get candidate documents and then applies a neural reranker to reorder them, delivered the best performance with a Recall@5 of 0.816 and an MRR@3 of 0.605, significantly outperforming any single-stage method.
One of the most surprising results challenges prevailing wisdom in AI: the classic BM25 algorithm, a statistical keyword-matching method, outperformed state-of-the-art dense retrieval (semantic search) on this financial QA benchmark. This indicates that for precise numerical and factual queries in structured domains, semantic understanding isn't always superior to traditional keyword search. The study also found that popular query expansion techniques like HyDE provided limited benefit for these precise queries, while adding contextual information to the retrieval index yielded more consistent gains. The authors provide actionable recommendations for balancing cost and accuracy and have released their full benchmark code to help practitioners build more effective RAG systems for tabular data.
- A hybrid retrieval + neural reranking pipeline achieved top scores (Recall@5: 0.816, MRR@3: 0.605) on a 23,088-query financial benchmark.
- The BM25 algorithm outperformed modern dense retrieval methods, challenging the assumption that semantic search is always better.
- Query expansion methods like HyDE showed limited value for precise numerical queries, while contextual retrieval improvements were more reliable.
Why It Matters
This provides data-driven architecture guidance for developers building accurate RAG systems for finance, reports, and any domain mixing text with tables.