Adds NVFP4 MTP scale tensors for improved multi-token prediction with NVIDIA FP4 quantization?

Adds NVFP4 MTP scale tensors for improved multi-token prediction with NVIDIA FP4 quantization

Links Qwen3.5 MTP tensors, expanding model compatibility?

Links Qwen3.5 MTP tensors, expanding model compatibility

Provides pre-built binaries for macOS, Linux, Windows, Android, and openEuler with multiple backends (Vulkan, CUDA, ROCm, SYCL, HIP)?

Provides pre-built binaries for macOS, Linux, Windows, Android, and openEuler with multiple backends (Vulkan, CUDA, ROCm, SYCL, HIP)

Developer Tools

llama.cpp b9297 adds NVFP4 MTP support for Qwen3.5 models

llama.cpp Releases May 24, 2026

⚡New release improves performance with 4-bit FP MTP tensors and enhanced model compatibility.

Deep Dive

llama.cpp (ggml-org) released b9297, adding NVFP4 MTP scale tensors, linking Qwen3.5 MTP tensors, and aligned nullptr. Includes builds for macOS, Linux, Windows, Android, and openEuler with CPU, Vulkan, CUDA, ROCm, SYCL, and other backends.

Key Points

Adds NVFP4 MTP scale tensors for improved multi-token prediction with NVIDIA FP4 quantization
Links Qwen3.5 MTP tensors, expanding model compatibility
Provides pre-built binaries for macOS, Linux, Windows, Android, and openEuler with multiple backends (Vulkan, CUDA, ROCm, SYCL, HIP)

Why It Matters

Professionals running local LLMs can now leverage NVFP4 MTP for faster inference with Qwen3.5 models, reducing hardware barriers.

Read Original Article

llama.cpp b9297 adds NVFP4 MTP support for Qwen3.5 models

Why It Matters

Related Articles

🚀 Stay Ahead in AI