StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems
New open-source dataset reveals hybrid retrieval bests BM25 and dense methods.
A new open-source dataset called StratRAG, created by researcher Aryan Patodiya and published on arXiv, is set to advance how Retrieval-Augmented Generation (RAG) systems handle complex, multi-hop reasoning tasks. Derived from the HotpotQA dataset's distractor setting, StratRAG includes 2,200 carefully curated examples spanning three question types: bridge (requiring linking multiple facts), comparison (contrasting entities), and yes-no. Each example comes with a pool of 15 candidate documents, containing exactly 2 gold documents and 13 topically related distractors, simulating realistic, noisy conditions where irrelevant information competes for retrieval attention.
The paper benchmarks three retrieval strategies—BM25 (a classic keyword-based method), dense retrieval using the all-MiniLM-L6-v2 sentence transformer, and hybrid fusion that combines both approaches. Results show hybrid retrieval as the overall winner, achieving a Recall@2 of 0.70 and Mean Reciprocal Rank (MRR) of 0.93 on the validation set. However, bridge questions proved substantially harder, with a lower Recall@2 of 0.67, highlighting a key weakness in current RAG systems for multi-hop reasoning. The dataset is publicly available, and the author suggests future work on reinforcement-learning-based retrieval policies to address these gaps.
- StratRAG includes 2,200 examples from HotpotQA with three question types: bridge, comparison, and yes-no.
- Each example has 15 candidate documents (2 gold, 13 distractors) to simulate noisy retrieval conditions.
- Hybrid retrieval achieves best performance (Recall@2=0.70, MRR=0.93), but bridge questions lag at Recall@2=0.67.
Why It Matters
StratRAG provides a realistic benchmark to improve RAG systems' multi-hop reasoning, critical for enterprise AI applications.