Open Source

Gemma 4 26b A3B is mindblowingly good , if configured right

A developer found Gemma 4 runs flawlessly at 260K context with 80-110 tokens/sec, outperforming Qwen and perplexity for agentic coding.

Deep Dive

Google's Gemma 4 26B model is emerging as a powerhouse for local AI development, particularly when optimized. A developer's extensive testing on an RTX 3090 revealed that with the correct configuration—specifically the unsloth q3k_m quantization, flash attention, and a Q4 quant—the model can maintain blistering speeds of 80-110 tokens per second. Crucially, it supports context windows up to 260,000 tokens without performance degradation, a feat that other popular local models like Qwen 3.5 MoE struggled with due to prompt caching bugs on certain systems.

In practical testing, Gemma 4 demonstrated exceptional capability for agentic coding and tool calling. Over a six-hour session with the 2.7GB Open Code repository, the model flawlessly navigated and explained the complex codebase, a task where it reportedly "cannot fail." The user compared its reasoning quality to Anthropic's Claude Sonnet, noting superior performance in LM Studio when connected to a search plugin versus services like Perplexity. While VRAM-heavy (requiring ~24GB for full context with tool calling), Gemma 4's robust support in Ollama and LM Studio makes it a uniquely reliable and powerful option for developers building local AI coding assistants.

Key Points
  • Achieves 80-110 tokens/sec at 260K context on RTX 3090 with Q4 quantization and flash attention
  • Outperforms Qwen 3.5 MoE on reliability, solving tool-calling infinite loops and Windows/LM Studio caching bugs
  • Demonstrated Claude Sonnet-level quality analyzing a 2.7GB codebase and excelling at agentic workflows and search

Why It Matters

Provides developers with a free, locally-runnable AI coding agent that rivals premium cloud models in quality and context handling.