Open Source

I tested as many of the small local and OpenRouter models I could with my own agentic text-to-SQL benchmark. Surprises ensured...

A new agentic text-to-SQL benchmark pits 25+ models against tough queries, with surprising winners.

Deep Dive

Developer Nick Lothian has launched a new, publicly accessible benchmark designed to stress-test AI models on a critical real-world task: agentic text-to-SQL generation. The benchmark presents models with 25 complex English queries—like calculating revenue per unit and margin percentages across product subcategories—and challenges them to produce executable SQL. Crucially, the AI agent can see query results and engage in limited debugging rounds to fix errors, mimicking a practical development workflow. The test is deliberately short and fast, running in under five minutes for most models, but is "tough enough to separate the best models from the others." A key innovation is the integration of a Llama.cpp WASM version, allowing anyone to run the benchmark against their own local server or via OpenRouter.

The results have yielded significant surprises, reshuffling expectations around model capabilities. Leading the pack among open models are Kimi-K2.5 and two Qwen variants: the massive 397B-A17B and the surprisingly efficient 27B parameter model. In a notable upset, NVIDIA's Nemotron-Cascade-2-30B-A3B outscores the larger Qwen 3.5-35B-A3B and matches the performance of Codex 5.3. Lothian also highlights Mimo v2 Flash as an underrated "gem." The benchmark is now live for community validation, with Lothian actively soliciting scores and feedback for a planned v2, aiming to create a standard for evaluating practical, agentic reasoning in data tasks.

Key Points
  • The benchmark tests 25 complex text-to-SQL queries with agentic debugging capabilities, completing in under 5 minutes.
  • Top performers include open models Kimi-K2.5 and Qwen 3.5 (397B & 27B), with NVIDIA's Nemotron-30B beating larger rivals.
  • It's now a public tool with local WASM support via Llama.cpp, enabling community testing and model comparison.

Why It Matters

Provides a fast, practical standard for evaluating AI on real database tasks, guiding developers toward the most effective open models for data workflows.