1.5x tokens per second speed improvement using Multi-token Prediction (MTP) vs standard models?

1.5x tokens per second speed improvement using Multi-token Prediction (MTP) vs standard models.

300K token context window achieved on a single 32GB RDNA 4 GPU (Asus Radeon R9700 AI Pro) using Q8_0 KV cache?

300K token context window achieved on a single 32GB RDNA 4 GPU (Asus Radeon R9700 AI Pro) using Q8_0 KV cache.

Model is a 35B MoE architecture (Qwen3.6-35B-A3B) – user later switched to 27B non-MoE due to stability issues deep in context?

Model is a 35B MoE architecture (Qwen3.6-35B-A3B) – user later switched to 27B non-MoE due to stability issues deep in context.

Open Source

Qwen 3.6 MTP model boosts local inference speed by 1.5x with 300K context

r/LocalLLaMA May 15, 2026

⚡A Reddit user pushes a 35B MTP model to 300K context on a single 32GB Radeon GPU.

Deep Dive

In a recent Reddit post, user Jorlen shared promising results from testing the new Qwen3.6-35B-A3B-UD-Q5_K_S model with Multi-token Prediction (MTP)—a technique that predicts multiple tokens ahead, boosting inference speed. Running on an Asus Radeon R9700 AI Pro (32GB RDNA4) GPU under Ubuntu 24.04 with Vulkan, they used a Docker container (havenoammo/llama:vulkan-server) to access the MTP prototype of llama.cpp. The model, a 35B mixture-of-experts (MoE), delivered roughly 1.5x the tokens per second compared to standard models, making it a strong candidate for local LLM workflows. The quantized Q5_K_S version kept memory usage manageable while maintaining quality.

Jorlen pushed the context window to 300K tokens using KV cache at Q8_0 quantization, consuming 28.3GB of the 32GB VRAM—leaving headroom for up to 400K. The test involved building a step-by-step pygame dungeon game, simulating realistic project work. At deep context (around 200K), the MoE model encountered instability, prompting a switch to the Qwen 3.6 27B non-MoE version for further tests. Despite this, the results underscore MTP’s potential: faster generation and larger context windows on consumer-grade hardware, a significant leap from just a year ago. The local LLM community continues to push boundaries, enabling powerful models to run affordably at home.

Key Points

1.5x tokens per second speed improvement using Multi-token Prediction (MTP) vs standard models.
300K token context window achieved on a single 32GB RDNA 4 GPU (Asus Radeon R9700 AI Pro) using Q8_0 KV cache.
Model is a 35B MoE architecture (Qwen3.6-35B-A3B) – user later switched to 27B non-MoE due to stability issues deep in context.

Why It Matters

MTP models could dramatically enhance local LLM productivity by enabling faster, longer-context reasoning on consumer hardware.

Read Original Article

Qwen 3.6 MTP model boosts local inference speed by 1.5x with 300K context

Why It Matters

Related Articles

🚀 Stay Ahead in AI