Open Source

Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge

In a 30-question blind test judged by Claude Opus 4.6, Qwen 3.5 27B edged out Google's Gemma 4 models despite reliability issues.

Deep Dive

An independent researcher conducted a rigorous 30-question blind evaluation comparing three leading open-weight language models: Qwen 3.5 27B, Google's Gemma 4 31B, and Google's Mixture-of-Experts (MoE) variant Gemma 4 26B-A4B. Claude Opus 4.6 served as the sole judge, scoring each model's response on a 0-10 scale across five categories: code, reasoning, analysis, communication, and meta-alignment. The total cost for the evaluation was $4.50, and the judge maintained a 99.9% parse rate for consistency.

Qwen 3.5 27B emerged with the most wins, taking 14 of 30 matchups (46.7%), but the victory came with a significant caveat. It received three 0.0 scores, which the evaluator attributed to format failures or refusals rather than poor answers. Stripping these failures reveals Qwen's potential, with an adjusted average score of ~9.08, the highest of the trio. It dominated in reasoning (winning 5 of 6 questions) and analysis (4 of 6). Google's Gemma 4 31B won 12 matchups (40%) and excelled in communication, winning 5 of 6 questions. The MoE-based Gemma 4 26B-A4B won only 4 matchups (13.3%) and errored on two questions entirely, though when it worked, its average score matched the dense 31B model.

The evaluation uncovered notable operational characteristics. Gemma 4 31B exhibited extremely long response times, with some generations taking up to five minutes, suggesting heavy internal chain-of-thought processing that didn't consistently correlate with higher scores. Qwen 3.5 27B was notably more verbose, generating 3-5x more tokens per response on average, incurring a 'verbosity tax' that the judge didn't consistently penalize or reward. The researcher acknowledged methodological limitations, including the small sample size, potential biases from using a single LLM judge, and the use of custom, non-standardized questions. However, the test provides a compelling snapshot of current model capabilities and failure modes in a controlled, blind setting.

Key Points
  • Qwen 3.5 27B won 46.7% of matchups but had a 10% catastrophic failure rate, scoring 0.0 on three questions.
  • When excluding failures, Qwen's adjusted average score of ~9.08 was highest; it dominated reasoning and analysis tasks.
  • Google's Gemma 4 31B won 40% of matchups and excelled in communication, while its MoE variant (26B-A4B) matched its score when it didn't error.

Why It Matters

For developers choosing open models, the trade-off is clear: peak performance (Qwen) versus reliability and speed (Gemma).