Gemma 4 is good
The new 27-billion parameter model runs at ~1000 tokens/sec on an M1 Ultra while delivering more coherent chain-of-thought reasoning.
Early performance tests from the AI community reveal Google's Gemma 2 27B model, specifically the 'a4b' iteration, is a strong contender in the open-weight LLM space. Running quantized (Q4_K_XL) via llama.cpp on an Apple M1 Ultra, the model processes approximately 1000 tokens per second with a 20,000-token context window, matching the inference speed of Alibaba's larger Qwen3.5 35B model. More significantly, users report its reasoning quality surpasses Qwen3.5, with Gemma 2 producing more concise, coherent, and helpful chain-of-thought outputs without the 'inner-gaslighting' or looping observed in its competitor. The model also shows promising visual understanding and multilingual performance.
However, the release comes with technical caveats. The model's Key-Value (KV) cache—a memory structure for attention—is notably large, requiring about 22GB of VRAM for a full 260K-token context in FP16, plus ~18GB for the quantized model weights. While not as memory-optimized as Qwen3.5 or NVIDIA's Nemotron, community techniques like Sliding Window Attention (SWA) and upcoming methods like TurboQuant may mitigate this. A notable warning: the official Google AI Studio version reportedly underperforms compared to locally run GGUF files, and the model's safety filters are described as overly restrictive, particularly for medical queries, though custom prompting may help.
- Matches Qwen3.5 35B speed (~1000 tok/sec) on M1 Ultra while being a smaller 27B model, indicating better efficiency.
- Demonstrates superior reasoning with more coherent and helpful chain-of-thought, avoiding the looping issues seen in competitors.
- Has a large KV cache memory footprint (~22GB VRAM for 260K context) but shows strong visual and multilingual capabilities.
Why It Matters
This positions Gemma 2 as a highly efficient, reasoning-strong open model for local deployment, challenging larger models on performance-per-parameter.