MTPLX | 2.24x faster TPS | The native MTP inference engine for Apple Silicon
Boosts Qwen 3.6 27B from 28 to 63 tok/s on a MacBook Pro M5 Max with exact temperature sampling.
MTPLX (Multi-Token Prediction Inference Engine) is a new open-source tool that dramatically speeds up large language model inference on Apple Silicon by exploiting MTP (Multi-Token Prediction) heads already present in models like Qwen 3.6 27B. Instead of requiring a separate drafter model, MTPLX uses the model's own built-in MTP heads as speculative drafters, achieving up to 2.25x speed improvement with zero extra memory overhead. Testing on a MacBook Pro M5 Max with Qwen 3.6 27B (4-bit MLX) showed a jump from 28 tokens per second to 63 tok/s at temperature 0.6 with top_p 0.95 and top_k 20—the exact sampling settings Qwen recommends for coding. The engine uses mathematically exact probability-ratio rejection sampling with residual correction, ensuring output quality is preserved even at higher temperatures, making it suitable for creative writing and chat, not just greedy decoding.
MTPLX stands apart from existing speculative decoding projects like DFlash and DDTree, which are restricted to greedy (temperature 0) sampling and require external drafter models that consume additional memory and need to be created for each new model. By contrast, MTPLX supports adjustable temperatures for any task and works automatically on any model that ships MTP heads (depth configurable 2–7+). The engine is built on a patched MLX fork with custom Metal kernels, compiled verification graphs, and innovation-tape GDN rollback. It also includes a full CLI with wizard, model download/management, MTP compatibility detection, an OpenAI/Anthropic-compatible API server, browser and terminal chat, benchmarking suite, health diagnostics, and crash-safe fan control. For developers running local LLMs on Apple Silicon, MTPLX offers a practical way to get real-time performance for coding assistants and creative tools without sacrificing output fidelity.
- 2.24x speed increase on Qwen 3.6 27B with exact temperature sampling (28 → 63 tok/s at temp 0.6)
- Uses the model's own built-in MTP heads, requiring no external drafter model and no extra memory
- Supports temperature-based sampling with rejection sampling, unlike greedy-only speculative decoding projects like DFlash/DDTree
Why It Matters
Enables real-time, high-quality LLM inference on Apple Silicon for coding and creative work without output degradation.