Smaller models are getting scary good.
A 31B open-weight model used Python scripts to prove a puzzle impossible, then peer-reviewed and debunked a frontier model's answer.
A viral social media experiment has demonstrated a surprising reversal in AI capabilities, where a smaller, open-weight model from Google outperformed a much larger frontier model in rigorous reasoning. The test involved presenting both Google's Gemini 3 Deepthink (a massive mixture-of-experts model) and the 31-billion-parameter Gemma 4 with a complex, secretly unwinnable security puzzle. While Gemini 3 spent 15 minutes generating a highly structured but ultimately flawed answer—complete with a hallucinated mathematical equation to force a solution—Gemma 4 took a different approach. It leveraged its tool access to run multiple Python scripts, rigorously checking the puzzle's constraints and mathematically proving the problem was physically impossible.
In a follow-up act of 'agentic peer-review,' the user then fed Gemini's incorrect solution to Gemma 4 for analysis. The smaller model systematically tore it apart, catching a hard physical constraint violation and explicitly calling out the fatal logic flaw, even critiquing Gemini for being 'blinded by the professionalism of the output.' When Gemma's arguments were fed back to Gemini 3 Deepthink, the larger model immediately folded, acknowledging its internal verification had failed and its logic was broken. This incident highlights that raw parameter count isn't the sole determinant of effective reasoning, especially when models are equipped with tools for verification. The ability of a 31B model to not only solve a problem a larger model failed at, but also to deconstruct and correct the larger model's work, challenges assumptions about the scaling of intelligence.
- Gemma 4 (31B) used Python tool access to mathematically prove an 'unwinnable' puzzle was impossible, while Gemini 3 hallucinated a fake equation.
- Gemma 4 then performed a peer-review, debunking Gemini's answer by catching a physical constraint violation and a fatal logic flaw.
- When confronted with Gemma's critique, Gemini 3 Deepthink conceded its logic was broken, showing smaller models can out-reason larger ones in specific tasks.
Why It Matters
This demonstrates that tool-augmented reasoning and verification skills can trump raw model size, impacting how we evaluate and deploy AI for critical analysis.