Research & Papers

The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

arXiv cs.SI March 04, 2026

⚡Open-source infrastructure integrates 960GB of research data using BGE embeddings for semantic mapping.

Deep Dive

Researcher Jonas Wilinski has introduced the Science Data Lake, a groundbreaking open-source infrastructure that tackles the persistent fragmentation of scholarly data. The system integrates approximately 293 million uniquely identifiable papers from eight major open sources—including Semantic Scholar, OpenAlex, SciSciNet, and Papers with Code—into a unified, locally-deployable resource built on DuckDB and Parquet files. By normalizing DOIs while preserving source-level schemas, it creates a 960GB dataset spanning 22 schemas and 153 SQL views, making previously siloed research accessible through a single interface. The project addresses a critical bottleneck in meta-research and literature analysis, where cross-database studies have been notoriously difficult.

The technical breakthrough lies in its embedding-based ontology alignment, which uses BGE-large sentence embeddings to map 4,516 OpenAlex topics to 1.3 million terms across 13 scientific ontologies. This AI-powered mapping achieves an F1 score of 0.77 at the recommended threshold, outperforming traditional methods like TF-IDF and BM25. The resource has been validated through 10 automated checks and cross-source citation analysis, demonstrating high agreement (Pearson r = 0.76–0.87). Available for remote querying via HuggingFace or local deployment, the Science Data Lake includes structured documentation suitable for LLM-based research agents, enabling new forms of large-scale scientific discovery and analysis that were previously infeasible.

Key Points

Integrates 293 million papers from 8 sources (Semantic Scholar, OpenAlex, etc.) into 960GB of unified Parquet files
Uses BGE-large embeddings to map 4,516 topics across 13 ontologies with 77% F1 score, beating TF-IDF/BM25 baselines
Enables cross-source analyses via local DuckDB deployment or remote HuggingFace querying with LLM-agent-ready documentation

Why It Matters

Enables previously impossible large-scale meta-research and breaks down silos between academic databases for AI-powered discovery.

Read Original Article

The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

Why It Matters

Stay Ahead in AI