larger bpw quants that fit memory often beat smaller quants on generation speed and quality.

20–40% GPU speed boost but increases memory, limiting usability on 16GB GPUs.

CPU MTP not recommended; stick to NTP for best performance on CPUs?

CPU MTP not recommended; stick to NTP for best performance on CPUs.

Open Source

ByteShape's Qwen 3.6 35B GGUF: Larger NTP quants often beat smaller ones

r/LocalLLaMA May 20, 2026

⚡Surprising GPU speed boost: bigger quants outperform smaller bpw variants on quality and tokens per second.

Deep Dive

ByteShape has published Qwen 3.6 35B GGUF quantizations in two families: standard NTP (next-token prediction) and MTP (multi-token prediction). Their benchmarks across RTX 4090, 5090, Pro 6000, 4080, 5060 Ti, plus Intel i7, Ultra 7, Ryzen 9, and Raspberry Pi 5 reveal a counterintuitive finding: for NTP, "pick the largest quant that fits" works surprisingly well. Lower bits-per-weight (bpw) quants did not automatically win on speed or quality. In many cases, the largest release variant stayed competitive in prompt processing and token generation, meaning users should not blindly minimize bpw—if the larger model fits memory and context budget, it is often the better choice.

For MTP, the trade-off is starker. On GPUs, ByteShape saw a meaningful 20–40% generation-speed boost, though this is heavily workload dependent. However, MTP increases runtime memory, so on 16GB GPUs the larger MTP model becomes impractical at typical context sizes, making lower-bit MTP quants the usable recommendation. CPU MTP is not attractive: prompt processing is already slow on CPUs, and MTP makes it worse. ByteShape's CPU recommendation remains NTP. Notably, MMLU was excluded due to answer-format compliance issues in the full-precision model, ensuring quantization comparisons remain clean.

Key Points

NTP: larger bpw quants that fit memory often beat smaller quants on generation speed and quality.
MTP: 20–40% GPU speed boost but increases memory, limiting usability on 16GB GPUs.
CPU MTP not recommended; stick to NTP for best performance on CPUs.

Why It Matters

Practical guidance for LLM deployers on choosing quantization levels, balancing speed, quality, and memory across GPUs and CPUs.

Read Original Article

ByteShape's Qwen 3.6 35B GGUF: Larger NTP quants often beat smaller ones

Why It Matters

Related Articles

🚀 Stay Ahead in AI