MTP reduced prompt processing speed by over 50% on 6GB VRAM, negating minor TG gains (<5%)?

MTP reduced prompt processing speed by over 50% on 6GB VRAM, negating minor TG gains (<5%).

q4_0 quantization for draft KV cache saves VRAM with no quality difference vs q8_0?

q4_0 quantization for draft KV cache saves VRAM with no quality difference vs q8_0.

Tested on Qwen3.6-35B-A3B model with llama.cpp, 65K context, Ubuntu/Linux Mint?

Tested on Qwen3.6-35B-A3B model with llama.cpp, 65K context, Ubuntu/Linux Mint.

Open Source

MTP on Qwen3.6-35B-A3B with 6GB VRAM: Slow Prompt Processing Kills Gains

Q: q4_0 quantization for draft KV cache saves VRAM with no quality difference vs q8_0?

q4_0 quantization for draft KV cache saves VRAM with no quality difference vs q8_0.

Q: Tested on Qwen3.6-35B-A3B model with llama.cpp, 65K context, Ubuntu/Linux Mint?

Tested on Qwen3.6-35B-A3B model with llama.cpp, 65K context, Ubuntu/Linux Mint.

r/LocalLLaMA May 17, 2026

⚡New test reveals MTP hurts speed more than it helps on constrained laptops.

Deep Dive

A 2021 Asus ROG Zephyrus G14 laptop with an RTX 3060 (6GB VRAM) and 24GB RAM was used to benchmark the recently merged MTP (Multi-Token Prediction) support in llama.cpp. The test model was Unsloth's Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL (a Qwen variant). Fixed parameters included q8_0 main KV cache, 65,536 context, and spec-draft-n-max=2. The user varied ubatch size and draft KV quantization (q8_0 vs q4_0) across two tasks: a benchmark script and a 13k-token summarization. The results showed that MTP degraded prompt processing (PP) speed dramatically—PP took over 50% longer—while token generation (TG) speed improved by less than 5%. The acceptance rate hovered around 60-70%, but the overall latency was worse, making MTP counterproductive for this VRAM-constrained setup.

The user discovered a clear VRAM-saving win: using q4_0 quantization for the draft model's KV cache performed identically to q8_0 in terms of acceptance rate and TG speed, while freeing up precious megabytes of VRAM. This trick allowed the model to run with a larger fit-target or higher ubatch size, mitigating memory pressure. The hardware setup dedicated the RTX 3060 solely to llama.cpp, with the system using the integrated Radeon GPU. The conclusion is that MTP in its current form is not worth it on laptops with 6GB VRAM, but the q4_0 draft KV quantization tip is valuable for any MTP deployment on memory-limited hardware.

Key Points

MTP reduced prompt processing speed by over 50% on 6GB VRAM, negating minor TG gains (<5%).
q4_0 quantization for draft KV cache saves VRAM with no quality difference vs q8_0.
Tested on Qwen3.6-35B-A3B model with llama.cpp, 65K context, Ubuntu/Linux Mint.

Why It Matters

Provides practical guidance on avoiding MTP overhead on low-VRAM machines while offering a concrete VRAM optimization.

Read Original Article

MTP on Qwen3.6-35B-A3B with 6GB VRAM: Slow Prompt Processing Kills Gains

Why It Matters

Related Articles

🚀 Stay Ahead in AI