ByteShape's Qwen 3.6 35B GGUF: Larger NTP quants often beat smaller ones
Surprising GPU speed boost: bigger quants outperform smaller bpw variants on quality and tokens per second.
ByteShape has published Qwen 3.6 35B GGUF quantizations in two families: standard NTP (next-token prediction) and MTP (multi-token prediction). Their benchmarks across RTX 4090, 5090, Pro 6000, 4080, 5060 Ti, plus Intel i7, Ultra 7, Ryzen 9, and Raspberry Pi 5 reveal a counterintuitive finding: for NTP, "pick the largest quant that fits" works surprisingly well. Lower bits-per-weight (bpw) quants did not automatically win on speed or quality. In many cases, the largest release variant stayed competitive in prompt processing and token generation, meaning users should not blindly minimize bpw—if the larger model fits memory and context budget, it is often the better choice.
For MTP, the trade-off is starker. On GPUs, ByteShape saw a meaningful 20–40% generation-speed boost, though this is heavily workload dependent. However, MTP increases runtime memory, so on 16GB GPUs the larger MTP model becomes impractical at typical context sizes, making lower-bit MTP quants the usable recommendation. CPU MTP is not attractive: prompt processing is already slow on CPUs, and MTP makes it worse. ByteShape's CPU recommendation remains NTP. Notably, MMLU was excluded due to answer-format compliance issues in the full-precision model, ensuring quantization comparisons remain clean.
- NTP: larger bpw quants that fit memory often beat smaller quants on generation speed and quality.
- MTP: 20–40% GPU speed boost but increases memory, limiting usability on 16GB GPUs.
- CPU MTP not recommended; stick to NTP for best performance on CPUs.
Why It Matters
Practical guidance for LLM deployers on choosing quantization levels, balancing speed, quality, and memory across GPUs and CPUs.