Research & Papers

PreScience: A Benchmark for Forecasting Scientific Contributions

New benchmark reveals frontier models like GPT-5 score only 5.6/10 at predicting real scientific contributions.

Deep Dive

A multi-institutional team led by researchers from the Allen Institute for AI (AI2), University of Washington, and Northwestern University has launched PreScience, a groundbreaking benchmark designed to answer a critical question: Can AI systems trained on the scientific record forecast the advances that follow? The benchmark decomposes the research process into four interdependent generative tasks—collaborator prediction, prior work selection, contribution generation, and impact prediction—using a meticulously curated dataset of 98,000 recent AI-related papers. This dataset features disambiguated author identities and a structured graph of 502,000 total publications, providing a robust testbed for evaluating AI's capacity to anticipate scientific evolution. The goal is to create systems that could help researchers identify impactful directions and collaborators, fundamentally accelerating the pace of discovery.

The team established baselines and introduced LACERScore, a novel LLM-based metric for measuring contribution similarity that outperforms previous methods. Initial results reveal substantial headroom for improvement across all tasks. In contribution generation, for instance, frontier models like GPT-5 achieved an average similarity score of just 5.6 on a 1-10 scale when compared to actual published work. When these tasks were composed into a 12-month end-to-end simulation of scientific production, the resulting synthetic corpus was systematically less diverse and less novel than human-authored research from the same period. This gap highlights the current limitations of AI in replicating the creative, serendipitous nature of human science and sets a clear, measurable challenge for the next generation of scientific AI agents.

Key Points
  • PreScience benchmark tests four core forecasting tasks using a dataset of 98K AI papers and a graph of 502K total publications.
  • Introduced LACERScore, a new LLM-based evaluation metric that better approximates human judgment for contribution similarity.
  • Even top models like GPT-5 score only 5.6/10 on contribution generation, and simulated research is less novel than human work.

Why It Matters

This benchmark could lead to AI tools that help scientists identify breakthrough research directions and collaborators, accelerating discovery.