Qwen3.6-35B-A3B + little-coder scaffold scores 24.6% on Terminal-Bench 2.0, beating Gemini 2.5 Pro (19.6%) and Qwen3-Coder-480B (23.9%)?

Qwen3.6-35B-A3B + little-coder scaffold scores 24.6% on Terminal-Bench 2.0, beating Gemini 2.5 Pro (19.6%) and Qwen3-Coder-480B (23.9%).

Sub-10B model Qwen3.5-9B scores 9.2%, proving small open-source models are now viable on hard agentic benchmarks?

Sub-10B model Qwen3.5-9B scores 9.2%, proving small open-source models are now viable on hard agentic benchmarks.

Results highlight scaffold-model co-design as a critical factor in agentic performance, not just parameter count?

Results highlight scaffold-model co-design as a critical factor in agentic performance, not just parameter count.

Open Source

Qwen3.6-35B-A3B tops Terminal-Bench 2.0, beats Gemini 2.5 Pro

r/LocalLLaMA May 16, 2026

⚡Small open-source model beats 2x larger proprietary rivals on agentic benchmark.

Deep Dive

The Qwen team's latest models, Qwen3.6-35B-A3B and Qwen3.5-9B, have officially entered the Terminal-Bench 2.0 leaderboard, delivering surprising results that challenge the status quo of agentic AI performance. Using the little-coder scaffold, the 35B-A3B variant achieved a score of 24.6% (±3.2), edging out Google's Gemini 2.5 Pro (19.6%) and Alibaba's own Qwen3-Coder-480B (23.9%). This marks a significant milestone for open-source models, which often rely on more parameters but less efficient scaffolding.

The smaller Qwen3.5-9B model also posted a respectable 9.2%, a sign that sub-10B parameter models can now be reasonably measured on hard agentic benchmarks like Terminal-Bench 2.0, where they would previously have been dismissed. The community-driven little-coder scaffold appears to close the gap between model capability and agentic task performance, even on challenging terminal-based evaluations. The results underscore that scaffold optimization—not just raw model size—is a key lever for practical AI agents. With this momentum, the open-source community is poised to push even higher on the leaderboard, aiming to reduce compute requirements while maintaining competitive agentic performance.

Key Points

Qwen3.6-35B-A3B + little-coder scaffold scores 24.6% on Terminal-Bench 2.0, beating Gemini 2.5 Pro (19.6%) and Qwen3-Coder-480B (23.9%).
Sub-10B model Qwen3.5-9B scores 9.2%, proving small open-source models are now viable on hard agentic benchmarks.
Results highlight scaffold-model co-design as a critical factor in agentic performance, not just parameter count.

Why It Matters

Open-source models, with clever scaffolding, can outrun proprietary giants on agentic tasks—cutting compute costs for real-world AI agents.

Read Original Article

Qwen3.6-35B-A3B tops Terminal-Bench 2.0, beats Gemini 2.5 Pro

Why It Matters

Related Articles

🚀 Stay Ahead in AI