New llama.cpp Docker images simplify running MTP models on GPUs
Run multi-token prediction locally with CUDA, Vulkan, Intel, or ROCm Docker images
Developer built Docker images for llama.cpp to run multi-token prediction (MTP) models, with tags for CUDA 12/13, Vulkan, Intel, and ROCm backends. Unsloth released MTP GGUF models for Qwen 3.6 (27B and 35B-A3B). The developer's own MTP models use Q8 quantization for MTP layers, while Unsloth quantizes some MTP layers at lower levels. The Docker images serve as a drop-in replacement until official builds support MTP. Users can choose between higher precision (more VRAM) or lower quantization (less VRAM) for MTP layers.
- Docker images support CUDA 12/13, Vulkan, Intel, and ROCm for broad GPU compatibility
- Unsloth released MTP GGUF models for Qwen 3.6 (27B and 35B-A3B) with lower MTP layer quantization
- Havenoammo's Q8 MTP layers offer higher precision but use ~200 MB more VRAM than Unsloth's quantized versions
Why It Matters
Brings multi-token prediction to diverse GPU setups, enabling faster local inference with speculative decoding.