Open Source

Gemma4 26B A4B Apex quant hits 38 tps at 90k context on 16GB VRAM

New quantization technique runs a 26B model with no quality loss at huge context lengths.

Deep Dive

A Reddit user discovered that mudler's APEX quantization for Google's Gemma-4 26B A4B model delivers remarkable performance on consumer hardware. Running on an AMD RX 9060 XT 16GB GPU with llama.cpp Vulkan, they achieved 38 tokens per second at a 90,000 token context window—without any looping, memory errors, or noticeable quality degradation. The APEX-I-Compact variant used only 15GB of VRAM, fitting comfortably within a 16GB card.

For comparison, the same user had previously tried an unsloth UD-Q5KXL quant of the same model, which required 21.2GB and looped (failed) at just 50k context. This indicates that the APEX quantization method may offer better memory efficiency and stability for long-context inference. While not claiming universal superiority, the user suggests it's a worthwhile experiment for anyone running large language models locally. This breakthrough could enable more accessible long-context AI applications on mid-range GPUs.

Key Points
  • Achieved 38 tokens per second at 90,000 context on a 16GB RX 9060 XT with llama.cpp Vulkan using mudler's APEX quant.
  • APEX-I-Compact variant used only 15GB VRAM with zero looping and no quality degradation.
  • Previous unsloth UD-Q5KXL quant (21.2GB) failed at 50k context, highlighting APEX's superior memory efficiency.

Why It Matters

Enables running large 26B models with 90k context on affordable 16GB GPUs, democratizing long-context AI inference.