Qwen3.6 35B MoE on RTX 5080: 56 tok/s at 128k context — MTP hurts more than helps
Multi-Token Prediction slows down MoE at coding-agent context lengths, benchmark shows.
A benchmark run on an RTX 5080 16GB (Ryzen 9 9950X, 128GB RAM, llama.cpp b9204) tested three configurations of Qwen3.6: 27B IQ3, 35B Q4_K_XL, and 35B Q8_0, all with and without MTP (Multi-Token Prediction, recently merged into llama.cpp at b9190). The surprising result: for the 35B MoE model, MTP made inference 23% slower at full 128k context. The reason is that MTP requires reserving ~1.5 GB of VRAM for a compute buffer (via --fit-target 1536), which pushes about 3 more MoE expert layers from GPU to CPU. Since CPU-bound expert layers are the main bottleneck for MoE inference, the ~79% token acceptance rate of MTP cannot compensate for the slower per-step speed. The optimal config is 35B Q4_K_XL with no MTP and --fit-target 1536, achieving 56 tok/s generation and 1,584 tok/s prompt processing at 131k context. For users with 12GB cards or those who can accept a 56k context window, the 27B IQ3 model fits entirely on GPU and benefits from MTP (73 tok/s). The benchmark also shows that at 128k context, both MTP and non-MTP converge to the same token generation speed (~56 tok/s), confirming that MTP offers no advantage at long context for models that don't fit entirely on GPU.
- Without MTP, Qwen3.6 35B Q4_K_XL runs at 56 tok/s generation and 1,584 tok/s prompt processing at 128k context on RTX 5080 16GB.
- MTP forces a 1.5 GB VRAM buffer, pushing MoE expert layers to CPU, causing a 23% speed drop despite 79% token acceptance.
- For the fully GPU-resident 27B IQ3 model, MTP boosts speed from ~56 to 73 tok/s; rule of thumb: MTP helps only if model fits entirely on GPU.
Why It Matters
Coding agents relying on long-context MoE models should disable MTP to maximize inference speed on consumer GPUs.