AtomicBot boosts Qwen 3.6 on LLaMA.cpp with Multi-Token Prediction, 40% faster inference
MacBook Pro M5 Max runs Qwen 27B at 34 tokens/s with 90% acceptance rate
AtomicBot has open-sourced a significant optimization for running large language models locally: Multi-Token Prediction (MTP) integrated into LLaMA.cpp alongside their TurboQuant quantization. The technique, implemented for the Qwen 3.6 family (27B and 35B parameter variants), predicts multiple future tokens per inference step rather than generating one at a time. Combined with TurboQuant’s aggressive quantization to GGUF format, the system achieves a 40% speedup in token generation — from 21 tokens/s to 34 tokens/s on a MacBook Pro M5 Max with 64GB of RAM. Critically, the speculative predictions boast a 90% acceptance rate, meaning the model rarely has to fall back on sequential decoding, which is the usual bottleneck for autoregressive generation.
The project provides two key assets for the local AI community: a patched fork of LLaMA.cpp with MTP and TurboQuant support on GitHub, and pre-quantized Qwen 3.6 models (27B and 35B) on HuggingFace. This means developers and enthusiasts can immediately download the optimized runtime and model weights, bypassing the typical need to fine-tune or implement speculative decoding from scratch. For anyone running large models on consumer-grade hardware — especially Apple Silicon laptops — this represents a practical path to interactive-speed inference without cloud APIs. The repository also links to Atomic.Chat, a local AI models app, suggesting a broader ecosystem push toward self-hosted LLM deployment.
- Multi-Token Prediction (MTP) increases inference throughput by 40% (21 → 34 tokens/s) on a MacBook Pro M5 Max 64GB.
- Achieves 90% acceptance rate for predicted tokens, minimizing costly speculative decoding failures.
- Pre-quantized Qwen 3.6 models (27B and 35B) in GGUF format are available on HuggingFace, with patched LLaMA.cpp on GitHub.
Why It Matters
Enables high-speed local inference of large models on consumer hardware, reducing cloud dependency for developers.