Research & Papers

A Collection of Systematic Reviews in Computer Science

arXiv cs.IR April 21, 2026

⚡New open-source corpus with 104,316 references exposes limitations of naive LLM-generated Boolean queries.

Deep Dive

A team of researchers led by Pierre Achkar and Martin Potthast has released SR4CS, a major new dataset designed to advance AI automation in academic literature review. The collection comprises 1,212 systematic reviews from computer science, complete with 104,316 resolved references and the original expert-designed Boolean search queries used to find them. This addresses a critical gap: while systematic reviews are the gold standard for synthesizing evidence, creating them requires immense manual effort, and evaluation resources have been largely confined to biomedical domains, limiting reproducible AI research elsewhere.

SR4CS is structured to support controlled experimentation on three key automation tasks: Boolean query generation, document retrieval, and paper screening. To enable fair comparisons, the researchers also provide normalized versions of the expert queries. In baseline experiments, they compared these expert queries against queries generated by zero-shot LLMs, BM25 keyword search, and modern dense retrieval methods. The results reveal systematic differences in precision, recall, and ranking behavior, exposing clear limitations in naive LLM-generated Boolean queries compared to human expert strategies.

The dataset, released under an open license on Zenodo with full documentation and code, provides a standardized benchmark for the research community. It moves the field beyond anecdotal testing and allows for reproducible evaluation of how different AI paradigms perform on the complex, multi-step task of literature review automation. This is a foundational step toward developing more reliable AI assistants that can help researchers scale evidence synthesis across scientific disciplines.

Key Points

Contains 1,212 systematic reviews with 104,316 references and original expert Boolean queries.
Baseline tests expose precision/recall trade-offs between expert queries, LLM-generated queries, and dense retrieval.
Released as an open-source benchmark to enable reproducible AI research on automating literature reviews.

Why It Matters

Provides a standardized testbed to build and evaluate AI tools that can drastically reduce the time required for comprehensive literature reviews.

Read Original Article

A Collection of Systematic Reviews in Computer Science

Why It Matters

Stay Ahead in AI