b9080
New release enables local inference of Gemma4 26B with NVFP4 quantization
Deep Dive
llama.cpp's b9080 release adds support for the Gemma4_26B_A4B_NVFP4 model, with a Hugging Face checkpoint conversion to GGUF format. Builds are available for macOS (Apple Silicon, Intel, iOS), Linux (x64, arm64, s390x with various backends), Windows (CPU, CUDA, Vulkan, SYCL, HIP), Android, and openEuler.
Key Points
- Added support for Gemma4_26B_A4B_NVFP4 model with GGUF format conversion
- NVIDIA's NVFP4 quantization enables 4-bit floating point inference, cutting memory usage ~4x
- Pre-built binaries for macOS, Linux, Windows, Android, and iOS across CPU and GPU backends
Why It Matters
Puts a 26B parameter model on local devices using NVIDIA's efficient FP4 quantization, democratizing advanced AI inference.