Qwen 3.6 27B vs Gemma 4 31B - making Packman game!
Gemma built a fully working Pac-Man in 4 minutes vs Qwen's 18.
In a recent benchmark comparing two open-source language models on a real-world coding task, Google's Gemma 4 31B decisively outperformed Alibaba's Qwen 3.6 27B. The test: generate a complete, playable Pac-Man HTML game from a single prompt. Both models ran locally on a MacBook Pro M5 Max with 64GB RAM. Qwen produced a more verbose response with creative visual elements and a neon aesthetic, but took 18 minutes and 4 seconds and consumed 33,946 tokens. Gemma, despite generating at 27 tokens/sec (vs Qwen's 32), finished in just 3 minutes and 51 seconds using only 6,209 tokens. Crucially, Gemma's output was cleaner, more logically consistent, and fully functional out of the box. The game featured smooth tile-based movement, proper ghost pathfinding, wall collision, and working power pellets—meeting all the contest specifications. Qwen's version, while visually impressive, had gameplay bugs and took significantly longer to produce.
The contest underscores a key insight for developers deploying LLMs for code generation: raw token throughput is secondary to the quality and conciseness of the final answer. Gemma 4 31B demonstrated that a model can produce high-quality code with fewer iterations and less token spend, directly impacting cost and latency in production use. For tasks requiring precise logic—like game development, API integration, or data processing—a model that understands the problem and delivers a compact, correct solution is more valuable than one that merely generates large amounts of text. The results also highlight the importance of evaluating LLMs on task-specific benchmarks rather than generic perplexity or speed metrics. As local models improve, this kind of real-world coding test becomes a critical differentiator for enterprise adoption.
- Gemma 4 31B completed the Pac-Man game in 3m51s with 6,209 tokens vs Qwen's 18m04s and 33,946 tokens.
- Despite lower tokens/sec (27 vs 32), Gemma's output had superior game logic: smooth click reactions, correct wall and ghost handling.
- The contest demonstrates that efficiency and concise, correct code can be more valuable than verbose creativity in real-world coding tasks.
Why It Matters
For developers, Gemma shows that smaller, efficient models can outperform larger ones in practical coding tasks.