Qwen 3.6 MTP model boosts local inference speed by 1.5x with 300K context
A Reddit user pushes a 35B MTP model to 300K context on a single 32GB Radeon GPU.
In a recent Reddit post, user Jorlen shared promising results from testing the new Qwen3.6-35B-A3B-UD-Q5_K_S model with Multi-token Prediction (MTP)—a technique that predicts multiple tokens ahead, boosting inference speed. Running on an Asus Radeon R9700 AI Pro (32GB RDNA4) GPU under Ubuntu 24.04 with Vulkan, they used a Docker container (havenoammo/llama:vulkan-server) to access the MTP prototype of llama.cpp. The model, a 35B mixture-of-experts (MoE), delivered roughly 1.5x the tokens per second compared to standard models, making it a strong candidate for local LLM workflows. The quantized Q5_K_S version kept memory usage manageable while maintaining quality.
Jorlen pushed the context window to 300K tokens using KV cache at Q8_0 quantization, consuming 28.3GB of the 32GB VRAM—leaving headroom for up to 400K. The test involved building a step-by-step pygame dungeon game, simulating realistic project work. At deep context (around 200K), the MoE model encountered instability, prompting a switch to the Qwen 3.6 27B non-MoE version for further tests. Despite this, the results underscore MTP’s potential: faster generation and larger context windows on consumer-grade hardware, a significant leap from just a year ago. The local LLM community continues to push boundaries, enabling powerful models to run affordably at home.
- 1.5x tokens per second speed improvement using Multi-token Prediction (MTP) vs standard models.
- 300K token context window achieved on a single 32GB RDNA 4 GPU (Asus Radeon R9700 AI Pro) using Q8_0 KV cache.
- Model is a 35B MoE architecture (Qwen3.6-35B-A3B) – user later switched to 27B non-MoE due to stability issues deep in context.
Why It Matters
MTP models could dramatically enhance local LLM productivity by enabling faster, longer-context reasoning on consumer hardware.