Research & Papers

Reproducing Complex Set-Compositional Information Retrieval

Dense retrievers collapse from 42% recall to under 2% on complex set queries.

Deep Dive

A team of researchers (Degenhart, Timman, de Vries, Hasibi, Hoveyda) conducted a rigorous reproducibility study on complex set-compositional information retrieval — queries that combine multiple conditions using conjunction, disjunction, and exclusion. They benchmarked major retrieval families (lexical, dense neural, sparse, reasoning-targeted) on two testbeds: the existing QUEST and QUEST+Variants, and their own controlled benchmark LIMIT+. The study, accepted at SIGIR 2026, tests whether current systems genuinely satisfy logical constraints or exploit semantic shortcuts learned from pretraining.

The results are striking. On QUEST, the best neural retrievers more than double the effectiveness of BM25 (Recall@100 >0.41 vs. 0.20), but reasoning-focused methods like ReasonIR and Search-R1 don't consistently outperform general-purpose dense models. However, on LIMIT+, where relevance depends on arbitrary attribute predicates rather than pretrained knowledge, the strongest QUEST method collapses from ~0.42 recall to below 0.02 — while classic lexical BM25 jumps to ~0.96. Stratifying by compositional depth reveals a consistent pattern: algebraic sparse and lexical methods maintain stable performance, while dense approaches degrade sharply. The authors release code and LIMIT+ generation scripts to facilitate future controlled evaluation.

Key Points
  • On QUEST, best neural retrievers achieve Recall@100 >0.41, more than double BM25's 0.20
  • On LIMIT+ benchmark, strongest QUEST method collapses to <0.02 recall while BM25 reaches ~0.96
  • Dense methods degrade consistently with compositional depth; lexical and sparse methods remain stable

Why It Matters

Challenges the assumption that neural retrievers handle Boolean logic, critical for enterprise search and database systems.