Qwen3.6-35B-A3B tops Terminal-Bench 2.0, beats Gemini 2.5 Pro
Small open-source model beats 2x larger proprietary rivals on agentic benchmark.
The Qwen team's latest models, Qwen3.6-35B-A3B and Qwen3.5-9B, have officially entered the Terminal-Bench 2.0 leaderboard, delivering surprising results that challenge the status quo of agentic AI performance. Using the little-coder scaffold, the 35B-A3B variant achieved a score of 24.6% (±3.2), edging out Google's Gemini 2.5 Pro (19.6%) and Alibaba's own Qwen3-Coder-480B (23.9%). This marks a significant milestone for open-source models, which often rely on more parameters but less efficient scaffolding.
The smaller Qwen3.5-9B model also posted a respectable 9.2%, a sign that sub-10B parameter models can now be reasonably measured on hard agentic benchmarks like Terminal-Bench 2.0, where they would previously have been dismissed. The community-driven little-coder scaffold appears to close the gap between model capability and agentic task performance, even on challenging terminal-based evaluations. The results underscore that scaffold optimization—not just raw model size—is a key lever for practical AI agents. With this momentum, the open-source community is poised to push even higher on the leaderboard, aiming to reduce compute requirements while maintaining competitive agentic performance.
- Qwen3.6-35B-A3B + little-coder scaffold scores 24.6% on Terminal-Bench 2.0, beating Gemini 2.5 Pro (19.6%) and Qwen3-Coder-480B (23.9%).
- Sub-10B model Qwen3.5-9B scores 9.2%, proving small open-source models are now viable on hard agentic benchmarks.
- Results highlight scaffold-model co-design as a critical factor in agentic performance, not just parameter count.
Why It Matters
Open-source models, with clever scaffolding, can outrun proprietary giants on agentic tasks—cutting compute costs for real-world AI agents.