Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%
New MTP technique pushes Gemma 26B to 138 tokens/s on MacBook
Reddit user gladkos has integrated Multi-Token Prediction (MTP) into LLaMA.cpp, delivering a 40% performance boost for Gemma 4 assistant models quantized to GGUF format. In tests on a MacBook Pro with M5Max chip, the 26B parameter model generated Python Fibonacci code at 97 tokens/s without MTP and 138 tokens/s with MTP—a clear increase. MTP works by predicting multiple future tokens in a single forward pass, reducing inference latency without sacrificing output quality.
The implementation is open-source: the patched llama.cpp is on GitHub (AtomicBot-ai/atomic-llama-cpp-turboquant), and pre-quantized Gemma 4 GGUF models are available on Hugging Face. This advancement means developers running large language models locally can achieve near-realtime responsiveness, especially on Apple Silicon hardware. The work builds on Google's Gemma 4 architecture and the community-driven LLaMA.cpp project, making state-of-the-art AI more accessible for on-device applications.
- Multi-Token Prediction (MTP) yields 40% faster token generation (97 → 138 tok/s) for Gemma 26B on M5Max MacBook
- Implemented in LLaMA.cpp with quantized GGUF models available on Hugging Face
- Open-source patched llama.cpp released on GitHub for the community
Why It Matters
Up to 40% faster local inference brings responsive, private AI assistants to consumer hardware.