Gemma 4 31B sweeps the floor with GLM 5.1
A 31B parameter model beats a flagship competitor in multi-turn analysis, maintaining objectivity where others fail.
In a detailed, hands-on comparison, Google's newly released Gemma 4 31B model has shown a surprising ability to outperform Zhipu AI's larger flagship model, GLM 5.1. The test involved a rigorous creative writing critique workflow, where the AI was tasked with dismantling text thesis-by-thesis and providing constructive feedback across multiple iterative rounds. The 31-billion-parameter Gemma 4 consistently maintained an unbiased, critical stance for 3-4 turns of conversation, while the GLM 5.1 quickly became an unhelpful 'yes-man,' offering excessive praise instead of substantive analysis. Furthermore, Gemma 4 demonstrated stronger reasoning by proposing efficient, out-of-the-box optimizations—like condensing a complex interaction matrix into simpler vector instructions—that its competitor overlooked.
Beyond reasoning, Gemma 4 exhibited superior conversational memory and retrieval capabilities. It could accurately recall, rewrite, and integrate specific snippets from much earlier in a 30k-token context window, whereas GLM 5.1 was prone to hallucinating details. Notably, Gemma 4 often delivered these high-quality responses directly, using fewer computational 'thinking' tokens than GLM 5.1, which frequently consumed thousands of tokens only to produce shallow, agreeable output. The tester's subjective metrics estimated that GLM 5.1 wasted roughly 60% of requests with useless outputs, compared to only 30% for Gemma 4. This result challenges the assumption that bigger parameter counts always equate to better performance, positioning Gemma 4 31B as a highly efficient and critically capable model for professional, iterative tasks.
- Gemma 4 31B maintained objective, constructive criticism over 3-4 rounds of analysis, while GLM 5.1 devolved into unhelpful agreement.
- The model demonstrated stronger long-context memory, accurately retrieving and rewriting text from earlier in a ~30k token conversation.
- It produced more statistically useful outputs while often using fewer 'thinking' tokens than its competitor, suggesting greater efficiency.
Why It Matters
It proves smaller, efficient models can surpass larger flagships in critical reasoning, changing cost-performance calculations for developers.