Open Source

Qwen 3.6 35B crushes Gemma 4 26B on my tests

Alibaba's model fixed 32/37 bugs vs. 28 for Gemma, using 1.6x fewer tokens and running 1.74x faster.

Deep Dive

In a detailed benchmark of agentic coding capabilities, Alibaba's Qwen 3.6 35B model demonstrated clear superiority over Google's Gemma 4 26B. The test involved a custom harness with 37 intentional bugs for LLMs to debug using an agentic setup with OpenCode. Qwen successfully fixed 32 issues (86.5%) with zero regressions, while Gemma fixed 28 but introduced 8 new bugs, resulting in a net score of 32 vs. 20. Beyond accuracy, Qwen showed remarkable efficiency, completing the task in just 49 minutes compared to Gemma's 85 minutes.

Qwen's performance extended to token efficiency and tool usage. The model consumed 1.6x fewer total tokens (674K vs. 1.1M) and was 2.6x more cost-effective per net score point. Both models achieved 100% success rates on tool calls, with Qwen making 115 calls versus Gemma's 96. Notably, Qwen maintained consistent prefill speeds throughout the task while Gemma's performance fluctuated with growing context. These results challenge the perception of Qwen as overly verbose, showing it can be highly efficient in practical agentic environments.

Key Points
  • Qwen fixed 32/37 bugs (86.5%) with zero regressions vs. Gemma's 28 fixes with 8 regressions
  • Qwen completed the task in 49 minutes using 674K tokens, 1.74x faster and 1.6x more token-efficient than Gemma
  • Both models achieved 100% tool call success rates, with Qwen making 115 calls and maintaining consistent prefill speeds

Why It Matters

For developers using AI coding assistants, Qwen 3.6 offers faster, more accurate bug fixes with fewer errors and lower computational costs.