llama.cpp b9297 adds NVFP4 MTP support for Qwen3.5 models
New release improves performance with 4-bit FP MTP tensors and enhanced model compatibility.
Deep Dive
llama.cpp (ggml-org) released b9297, adding NVFP4 MTP scale tensors, linking Qwen3.5 MTP tensors, and aligned nullptr. Includes builds for macOS, Linux, Windows, Android, and openEuler with CPU, Vulkan, CUDA, ROCm, SYCL, and other backends.
Key Points
- Adds NVFP4 MTP scale tensors for improved multi-token prediction with NVIDIA FP4 quantization
- Links Qwen3.5 MTP tensors, expanding model compatibility
- Provides pre-built binaries for macOS, Linux, Windows, Android, and openEuler with multiple backends (Vulkan, CUDA, ROCm, SYCL, HIP)
Why It Matters
Professionals running local LLMs can now leverage NVFP4 MTP for faster inference with Qwen3.5 models, reducing hardware barriers.