Gemma 4 is efficient with thinking tokens, but it will also happily reason for 10+ minutes if you prompt it to do so.
Gemma 4's 31B model reasoned for 594 seconds without hallucinating on a complex cypher task.
Google's latest open-source AI models, Gemma 4, have revealed a surprising capability for extended reasoning when explicitly prompted by users. In benchmark testing against a complex cypher challenge—previously only solved by top closed-source models and select open-source ones like Kimi 2.5 Thinking and Deepseek 3.2—the Gemma 4 26B MoE and 31B dense models were instructed to "spare no effort" and maximize thinking length. The result was a dramatic shift from initial short, hallucinated responses to sustained, careful analysis.
The 26B model reasoned continuously for ten minutes before hitting what appears to be a platform timeout in AI Studio, while the 31B dense model processed for 594 seconds before concluding it could not solve the puzzle. Crucially, neither model produced a false answer, instead admitting the limitations of their analysis. This behavior contrasts with models like Qwen, which tend to overthink by default, and suggests Gemma 4's reasoning depth is directly controllable via natural language prompting rather than hidden parameters.
This finding raises intriguing questions about benchmark performance. While Gemma 4 currently trails similar-sized Qwen 3.5 models on standard benchmarks, its ability to engage in extended, deliberate reasoning when prompted could indicate untapped potential. If future testing confirms that performance scales with reasoning time, Gemma 4 could represent a significant advance in creating transparent, user-controllable AI systems that balance efficiency with depth on demand.
- Gemma 4's 31B dense model performed 594 seconds of continuous reasoning on a cypher task when prompted, avoiding hallucination.
- Unlike Qwen or closed models requiring parameter tweaks, Gemma 4's reasoning depth appears controllable via simple natural language instructions.
- The finding suggests benchmark scores may not reflect full potential, as extended reasoning could close performance gaps with leading models.
Why It Matters
Shows open-source models can achieve deliberate, verifiable reasoning, offering professionals more transparent and controllable AI tools.