SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas
New dataset built on 135,875 real database schemas tackles the biggest bottleneck in AI-powered data querying.
A team of researchers has unveiled SQaLe, a groundbreaking dataset designed to overcome a major hurdle in AI development: teaching models to reliably convert natural language into SQL database queries. The dataset, accepted at the AI for Tabular Data workshop at NeurIPS 2025, is built on a foundation of 135,875 real-world relational database schemas sourced from SchemaPile. Using a principled generation pipeline that combines schema sampling, question synthesis, and SQL construction, the team produced 517,676 high-quality (question, schema, query) triples. This scale and grounding in reality directly tackles the identified bottleneck of insufficient data with realistic schema complexity, domain coverage, and task diversity that has limited progress in generalizable text-to-SQL models.
SQaLe is engineered to capture the true variability of real-world data environments, including realistic schema sizes, diverse query patterns, and the natural ambiguity of human language, while ensuring all generated SQL is executable. The researchers' analysis positions SQaLe as the most realistic large-scale text-to-SQL dataset available, surpassing existing benchmarks. By providing this resource, the project enables a new vision for data scaling in the field, allowing AI researchers and engineers to train more robust models like GPT-4, Claude, or Llama on a corpus that mirrors the complexity of actual enterprise databases. This advancement is a critical step toward AI agents that can autonomously and accurately interact with organizational data, moving beyond simple demo queries to handling the messy, complex schemas found in production environments.
- Built on 135,875 real database schemas from SchemaPile, ensuring realistic complexity.
- Contains 517,676 high-quality text-to-SQL examples (question, schema, query triples).
- Designed to solve the data bottleneck for training generalizable AI models on complex queries.
Why It Matters
Enables development of AI that can reliably query complex business data, moving text-to-SQL from demos to real-world deployment.