Open Source

New llama.cpp Docker images simplify running MTP models on GPUs

Run multi-token prediction locally with CUDA, Vulkan, Intel, or ROCm Docker images

Deep Dive

Developer built Docker images for llama.cpp to run multi-token prediction (MTP) models, with tags for CUDA 12/13, Vulkan, Intel, and ROCm backends. Unsloth released MTP GGUF models for Qwen 3.6 (27B and 35B-A3B). The developer's own MTP models use Q8 quantization for MTP layers, while Unsloth quantizes some MTP layers at lower levels. The Docker images serve as a drop-in replacement until official builds support MTP. Users can choose between higher precision (more VRAM) or lower quantization (less VRAM) for MTP layers.

Key Points
  • Docker images support CUDA 12/13, Vulkan, Intel, and ROCm for broad GPU compatibility
  • Unsloth released MTP GGUF models for Qwen 3.6 (27B and 35B-A3B) with lower MTP layer quantization
  • Havenoammo's Q8 MTP layers offer higher precision but use ~200 MB more VRAM than Unsloth's quantized versions

Why It Matters

Brings multi-token prediction to diverse GPU setups, enabling faster local inference with speculative decoding.