Open Source

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared

Dense models Qwen3.5 27B and Gemma4 31B score 100%, MoE models lag behind.

Deep Dive

A detailed benchmark comparing Gemma4 26B MoE (Mixture of Experts) against dense models Qwen3.5 27B and Gemma4 31B reveals a clear performance gap. The dense models both aced the test, achieving 100% fix rates on 37 baseline failures with zero regressions or guardrail violations. In contrast, the Gemma4 26B MoE model, even with an upgraded Q8 quant (Q8_K_XL), only fixed 45.9% of tests, scoring 17 net fixes—worse than its Q4 quant which fixed 75.7% (20 net fixes). The Qwen3.6 35B MoE performed better, fixing 86.5% with 32 net fixes.

The efficiency metrics further highlight the dense models' superiority. Qwen3.5 27B used only ~16K tokens per fix and completed its run in 145 minutes, while Gemma4 31B took 629 minutes but used ~32K tokens per fix. The MoE models were far less efficient, with Gemma4 26B Q4 using ~39K tokens per fix and Qwen3.6 35B using ~21K. Tool call analysis shows dense models made more reads and edits per fix, indicating deeper codebase understanding. The results strongly suggest that for coding tasks, dense models currently outperform MoE architectures of similar parameter counts.

Key Points
  • Qwen3.5 27B dense and Gemma4 31B dense both achieved 100% fix rates on 37 baseline failures, with zero regressions.
  • Gemma4 26B MoE performed worst, fixing only 45.9% of tests in Q8 and 75.7% in Q4, despite higher quantization.
  • Dense models were more token-efficient: Qwen3.5 27B used ~16K tokens per fix vs. ~39K for Gemma4 26B MoE Q4.

Why It Matters

For developers, dense models like Qwen3.5 27B and Gemma4 31B offer superior reliability and efficiency for coding tasks over MoE architectures.