llama.cpp b9455 achieves 70+ tk/s on dual RTX 3090s with Qwen3.6-27B-MTP UDQ8KXL quant?

llama.cpp b9455 achieves 70+ tk/s on dual RTX 3090s with Qwen3.6-27B-MTP UDQ8KXL quant.

MTP speculative decoding and tensor-split (50/50 GPU allocation) deliver both speed and accurate code output?

MTP speculative decoding and tensor-split (50/50 GPU allocation) deliver both speed and accurate code output.

Outperforms vLLM in code quality due to better quant support, with consistent decode speeds of 66–78 tk/s?

Outperforms vLLM in code quality due to better quant support, with consistent decode speeds of 66–78 tk/s.

Open Source

llama.cpp b9455 with 2x3090 hits 70+ tk/s on Qwen3.6-27B MTP

r/LocalLLaMA June 03, 2026

⚡New llama.cpp build beats vLLM with tensor-split and MTP speculative decoding.

Deep Dive

The latest llama.cpp build b9455 has turned heads in the local AI community: a Reddit user running dual RTX 3090s achieved 70+ tokens per second on the Qwen3.6-27B-MTP model using Unsloth's UDQ8KXL quant. For months, vLLM dominated with tensor-split speeds of 70+ tk/s, but its poor quant compatibility led to minor coding mistakes. The llama.cpp breakthrough came from enabling tensor-split (50/50 split across GPUs) and MTP (multi-token prediction) speculative decoding, which together deliver both speed and accuracy. The user reports that code outputs are now 'clean' compared to vLLM, making llama.cpp the preferred local inference engine for 27B-parameter models.

The technical setup includes flags like --tensor-split 50,50, --flash-attn on, --cache-type-k q8_0, and --spec-type draft-mtp. Performance metrics show consistent decode speeds of 66–78 tk/s across varying context lengths (up to 68K tokens). Cold prefill takes ~54 seconds for 68K tokens (1247 t/s), while cached runs see decode at ~70 tk/s. This update enables professionals to run large-scale local AI inference with cloud-like speed and reliability, especially for code generation tasks where quality matters.

Key Points

llama.cpp b9455 achieves 70+ tk/s on dual RTX 3090s with Qwen3.6-27B-MTP UDQ8KXL quant.
MTP speculative decoding and tensor-split (50/50 GPU allocation) deliver both speed and accurate code output.
Outperforms vLLM in code quality due to better quant support, with consistent decode speeds of 66–78 tk/s.

Why It Matters

Local inference on consumer hardware now rivals cloud performance for 27B models, reducing latency and cost for developers.

Read Original Article

llama.cpp b9455 with 2x3090 hits 70+ tk/s on Qwen3.6-27B MTP

Why It Matters

Related Articles

🚀 Stay Ahead in AI