More Qwen3.6-27B MTP success but on dual Mi50s
Runs 60 tok/s on five-year-old GPUs using multi-token prediction and tensor parallelism.
A developer successfully ran the 27-billion-parameter Qwen3.6-27B model on dual AMD Radeon Instinct MI50 GPUs by leveraging Multi-Token Prediction (MTP) alongside tensor parallelism. Using a custom fork of llama.cpp with ROCm 7.2 on CachyOS, they reported dramatic speedups: stock inference averaged 26 tokens per second (tok/s), while MTP alone increased throughput to roughly 39–41 tok/s across various tasks. Enabling tensor parallelism further boosted stock performance to 34–35 tok/s, and combining both techniques yielded 56–60 tok/s—a roughly 2x improvement over baseline. The benchmarks included coding, summarization, math, and creative writing, with a 78% aggregate draft acceptance rate.
The approach involved grafting MTP support onto existing GGUF quantizations (Q4_1) and using a specialized llama.cpp fork that supports the technique on older compute architectures (gfx906). The developer noted that MTP works best when the draft model can predict multiple future tokens, and tensor parallelism distributes the workload across both MI50s. This is significant because the MI50 (based on the Vega 20 architecture) lacks the matrix acceleration of newer cards, yet still achieved competitive inference speeds. For professionals running AI on budget or legacy hardware, this demonstrates that multi-token prediction and parallelism can breathe new life into older GPUs, reducing the need for expensive upgrades.
- Stock Qwen3.6-27B runs at ~26 tok/s on dual MI50s; MTP alone boosts to ~40 tok/s.
- Combining MTP with tensor parallelism achieves up to 60 tok/s, a 2x speedup.
- 78% draft acceptance rate shows MTP efficiently predicts multiple tokens per step.
Why It Matters
Makes large language models practical on older AMD GPUs, lowering hardware costs for developers.