Fine-tuned Qwen3 SLMs (0.6-8B) beat frontier LLMs on narrow tasks
Tiny 0.6B parameter model achieves 98.7% on function calling, rivaling models 100x its size at a fraction of the cost.
A new benchmark from Distil Labs reveals that small, specialized AI models can outperform giant general-purpose ones on specific tasks. The team fine-tuned open-source Qwen3 models ranging from 0.6 billion to 8 billion parameters using distillation from open-weight teachers, requiring as few as 50 examples. When tested against frontier APIs like OpenAI's GPT-5, Anthropic's Claude 4.6, and Google's Gemini 2.5 across nine datasets, the results were striking. For instance, the tiny Qwen3-0.6B model achieved 98.7% accuracy on a Smart Home function calling task, beating Gemini Flash's 92.0%. On Text2SQL, a Qwen3-4B model scored 98.0%, nearly matching Claude Haiku's 98.7% at a tiny fraction of the cost.
The cost and performance advantages are dramatic. Running on a single H100 GPU, the distilled Qwen3-4B model for Text2SQL costs approximately $3 per million requests, compared to $378 for Claude Haiku and $24 for GPT-5 nano. The model also delivered high throughput with 222 requests per second and low latency. However, the study notes a clear trade-off: frontier models still dominate tasks requiring broad world knowledge and open-ended reasoning, like HotpotQA. The practical takeaway is that distillation is ideal for high-volume, structured tasks with well-defined schemas, while frontier APIs are better for low-volume or knowledge-intensive work. All code, models, and data from the benchmark are open-sourced.
- The 0.6B parameter Qwen3 model hit 98.7% accuracy on a Smart Home function calling task, outperforming larger frontier models.
- On Text2SQL, a distilled 4B model matched Claude Haiku's performance (98.0% vs 98.7%) at ~1% of the cost (~$3 vs $378 per million requests).
- The benchmark shows distilled models match or beat the best mid-tier frontier API on 6 out of 9 narrow, structured tasks.
Why It Matters
This provides a clear, cost-effective blueprint for enterprises to deploy efficient, specialized AI agents for high-volume tasks without relying on expensive APIs.