Paper2Data: Large-Scale LLM Extraction and Metadata Structuring of Global Urban Data from Scientific Literature
Researchers' new LLM system achieves 90% recall in mining hidden urban data from academic literature.
A research team from multiple institutions has developed Paper2Data, a novel large language model (LLM) pipeline designed to solve a major problem in urban research: the lack of a unified platform for discovering global urban datasets. Currently, researchers must manually sift through websites and dense scientific literature. Paper2Data automates this by scanning over 15,000 publications from Nature-affiliated journals, identifying mentions of datasets, and structuring them using a unified metadata schema. The result is UrbanDataMiner, an open portal that supports search and filtering across more than 60,000 extracted urban datasets.
The system demonstrates impressive accuracy, with human-annotated evaluations showing approximately 90% recall in identifying datasets and field-level precision above 80%. Crucially, UrbanDataMiner can retrieve over 9% of datasets that are not easily discoverable through general-purpose search engines like Google, revealing a significant amount of 'hidden' data locked within academic papers. The team has made their code and data publicly available, providing the first large-scale, literature-derived infrastructure for urban data discovery. This enables more systematic, efficient, and reusable data-driven research across urban planning, environmental science, public health, and other disciplines that rely on heterogeneous urban data.
- Paper2Data is an LLM pipeline that extracted 60,000+ urban datasets from 15,000+ scientific papers.
- It powers the UrbanDataMiner portal, achieving ~90% recall and >80% precision in dataset identification.
- The system uncovers 9% of datasets not easily found via Google, creating a first-of-its-kind discovery infrastructure.
Why It Matters
This automates the tedious hunt for urban data, unlocking hidden datasets to accelerate research in sustainability, planning, and public policy.