Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090
Optimized performance hits 87 t/s using advanced KV cache techniques.
Deep Dive
According to the article, Indrasmirror optimized Qwen3.6-27B on an RTX 4090 with 24GB VRAM, achieving 80–87 t/s after tuning, up from 43 t/s. Using TBQ4_0 (TurboQuant's lossless 4.25 bpv KV cache), a 262K context, and MTP draft acceptance at 73% (with draft 3), the author reports solid output quality. The fork is buildable at the provided GitHub link, with technical details on the kernel architecture in the blog post.
Key Points
- Achieved 80-87 t/s performance on RTX 4090, a significant boost from 43 t/s.
- Utilized TurboQuant's lossless KV cache and MTP, with a 262K context.
- Draft acceptance rate for MTP stands at 73%, ensuring high output quality.
Why It Matters
This advancement enhances AI model performance, benefiting developers and researchers alike.