Research & Papers

OpenExtract: Automated Data Extraction for Systematic Reviews in Health

Open-source pipeline uses LLMs to extract data from scientific papers, achieving human-level accuracy.

Deep Dive

A consortium of researchers from institutions including the University of Warwick and TU Delft has released OpenExtract, a novel open-source pipeline designed to automate one of the most labor-intensive phases of academic research: data extraction for systematic literature reviews. The system works by intelligently querying large language models (LLMs), prompting them to locate and predict specific data entries—like study outcomes, sample sizes, or intervention details—from designated sections within scientific PDFs. This approach moves beyond simple text scraping, aiming to understand and extract structured information with contextual awareness.

In a rigorous validation test applied to a systematic review in the digital health domain, OpenExtract's performance was benchmarked directly against human researchers. The results were compelling, with the pipeline achieving both precision and recall scores exceeding 0.8. This indicates a high level of accuracy and completeness, suggesting the tool can reliably identify relevant data points (recall) and do so correctly (precision). The open-source nature of the project, detailed in the arXiv preprint cs.IR/2603.13338, invites further development and application across other scientific fields burdened by manual evidence synthesis.

The development addresses a critical bottleneck in evidence-based medicine and policy. Systematic reviews, which form the foundation of clinical guidelines, often require teams to manually screen thousands of papers—a process taking months or years. By automating data extraction with LLM-powered precision, OpenExtract promises to drastically accelerate the pace of research synthesis, reduce human error, and free up expert time for higher-level analysis and interpretation.

Key Points
  • OpenExtract is an open-source pipeline that automates data extraction for systematic literature reviews using LLMs.
  • In testing on a digital health review, it achieved precision and recall scores >0.8, matching human researcher performance.
  • The tool specifically queries LLMs to predict structured data entries from relevant sections of scientific article PDFs.

Why It Matters

This could slash the months-long timeline of systematic reviews, accelerating evidence-based medical research and clinical guideline development.