llama.cpp b9455 with 2x3090 hits 70+ tk/s on Qwen3.6-27B MTP
New llama.cpp build beats vLLM with tensor-split and MTP speculative decoding.
The latest llama.cpp build b9455 has turned heads in the local AI community: a Reddit user running dual RTX 3090s achieved 70+ tokens per second on the Qwen3.6-27B-MTP model using Unsloth's UDQ8KXL quant. For months, vLLM dominated with tensor-split speeds of 70+ tk/s, but its poor quant compatibility led to minor coding mistakes. The llama.cpp breakthrough came from enabling tensor-split (50/50 split across GPUs) and MTP (multi-token prediction) speculative decoding, which together deliver both speed and accuracy. The user reports that code outputs are now 'clean' compared to vLLM, making llama.cpp the preferred local inference engine for 27B-parameter models.
The technical setup includes flags like --tensor-split 50,50, --flash-attn on, --cache-type-k q8_0, and --spec-type draft-mtp. Performance metrics show consistent decode speeds of 66–78 tk/s across varying context lengths (up to 68K tokens). Cold prefill takes ~54 seconds for 68K tokens (1247 t/s), while cached runs see decode at ~70 tk/s. This update enables professionals to run large-scale local AI inference with cloud-like speed and reliability, especially for code generation tasks where quality matters.
- llama.cpp b9455 achieves 70+ tk/s on dual RTX 3090s with Qwen3.6-27B-MTP UDQ8KXL quant.
- MTP speculative decoding and tensor-split (50/50 GPU allocation) deliver both speed and accurate code output.
- Outperforms vLLM in code quality due to better quant support, with consistent decode speeds of 66–78 tk/s.
Why It Matters
Local inference on consumer hardware now rivals cloud performance for 27B models, reducing latency and cost for developers.