Open Source

Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results

Q4 only 6% faster, but multi-GPU setups see bigger jumps

Deep Dive

Following his previous Reddit post, havenoammo released a Qwen3.6-35B-A3B GGUF with isolated MTP layers on HuggingFace. His own tests on a 5090 showed modest gains: Q4 got only a 6% speed increase, Q8 got 2.5%—far below the 2-2.5x boost seen on the 27B dense model. But another user reported a jump from 110 to 165 t/s with Q8 on 2x 5070 Ti + 3090, so results may vary by hardware. The author says findings are preliminary and might improve, but for anyone curious, it's worth a try.

Key Points
  • MTP grafting on Qwen3.6-35B-A3B yields only 6% speedup at Q4 and 2.5% at Q8 on a single RTX 5090
  • A multi-GPU setup (2x5070 Ti + 3090) saw 50% faster inference (110 → 165 t/s)
  • Contrasts with 2-2.5x gains on the 27B dense model, suggesting MoE architecture or llama.cpp MTP limitations

Why It Matters

MTP speculative decoding performance is architecture-dependent; MoE models need further optimization to benefit fully.