AI Safety

PaperRepro: Automated Computational Reproducibility Assessment for Social Science Papers

A new multi-agent system automates the costly process of verifying if social science research can be reproduced.

Deep Dive

A research team led by Linhao Zhang has introduced PaperRepro, a novel AI system designed to automate the assessment of computational reproducibility in social science papers. The tool addresses a critical bottleneck in scientific research: manually verifying that published results can be reproduced from the authors' code and data is notoriously time-consuming and expensive. PaperRepro tackles this by employing a two-stage, multi-agent approach that separates the execution of the reproduction package from the evaluation of its success. This architecture is a direct response to the limitations of previous agent-based methods, which often struggled with context capacity, inadequate tooling, and insufficient capture of results.

The system's first stage uses specialized agents to execute the provided code, edit it as needed, and capture the reproduced results as explicit artifacts. The second stage then uses different agents to evaluate reproducibility based on this captured evidence. By assigning distinct responsibilities and equipping agents with expert prompts and task-specific tools, PaperRepro mitigates previous limitations. The team reports that on their REPRO-Bench benchmark, PaperRepro achieved the best overall performance, showing a 21.9% relative improvement in score-agreement accuracy over the strongest prior baseline. The researchers also refined their benchmark into REPRO-Bench-S, stratified by execution difficulty for more diagnostic evaluations. This advancement represents a significant step toward scalable, automated verification of scientific claims, which is foundational for research credibility.

Key Points
  • Uses a novel two-stage, multi-agent architecture separating execution from evaluation to overcome context and tooling limits.
  • Achieved a 21.9% relative improvement in score-agreement accuracy on the REPRO-Bench benchmark over prior methods.
  • Introduced REPRO-Bench-S, a stratified benchmark for more diagnostic evaluation of automated reproducibility systems.

Why It Matters

Automates a critical, costly validation step, potentially increasing trust and efficiency in the scientific publication process.