Developer Tools

b9080

New release enables local inference of Gemma4 26B with NVFP4 quantization

Deep Dive

llama.cpp's b9080 release adds support for the Gemma4_26B_A4B_NVFP4 model, with a Hugging Face checkpoint conversion to GGUF format. Builds are available for macOS (Apple Silicon, Intel, iOS), Linux (x64, arm64, s390x with various backends), Windows (CPU, CUDA, Vulkan, SYCL, HIP), Android, and openEuler.

Key Points
  • Added support for Gemma4_26B_A4B_NVFP4 model with GGUF format conversion
  • NVIDIA's NVFP4 quantization enables 4-bit floating point inference, cutting memory usage ~4x
  • Pre-built binaries for macOS, Linux, Windows, Android, and iOS across CPU and GPU backends

Why It Matters

Puts a 26B parameter model on local devices using NVIDIA's efficient FP4 quantization, democratizing advanced AI inference.