DivSkill-SQL boosts text-to-SQL accuracy by 11% with residual skill optimization
New framework builds complementary agentic ensembles without fine-tuning, slashing errors.
Text-to-SQL systems often rely on generating multiple candidate queries and selecting the best one, but their effectiveness is limited by the Pass@K probability—the chance that at least one candidate is correct. Existing approaches source diversity through stochastic decoding or prompt variations, which often result in correlated failures. A new paper from researchers including Jiongli Zhu and colleagues introduces DivSkill-SQL, a framework that addresses this by optimizing residual skills on examples the current ensemble fails on. Each new skill is explicitly designed to target the marginal contribution to Pass@K, without requiring model retraining. The approach builds complementary agentic ensembles that provably improve coverage of correct SQL queries.
On the Spider2-Lite benchmark, DivSkill-SQL achieves significant gains: +11.1 points on Snowflake and +8.3 points on BigQuery over the strongest baseline, with consistent improvements across two base models (Opus-4.6 and GPT-5.4). Skills optimized on one dialect transfer without retraining to other SQL dialects and even to a different task formulation like BIRD-Critic (+2.6 points). Error analysis reveals up to 3x fewer hallucinated schema references and function calls, indicating that the gains come from genuinely complementary capabilities rather than surface-level variation. This opens the door to more reliable text-to-SQL systems without the cost of fine-tuning, particularly useful for enterprise data querying across diverse database backends.
- DivSkill-SQL improves selected accuracy by +11.1 points on Snowflake and +8.3 on BigQuery on Spider2-Lite.
- The framework builds complementary agentic ensembles without any model fine-tuning, directly targeting Pass@K marginal contributions.
- Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating genuine reliability gains.
Why It Matters
Smarter, cheaper SQL generation from natural language—no fine-tuning needed, fewer hallucinations, cross-database ready.