Fixes tensor parallelism ('--split-mode tensor') for quantized KV caches, previously only supported non-quantized?

Fixes tensor parallelism ('--split-mode tensor') for quantized KV caches, previously only supported non-quantized.

Achieves 30.05 t/s vs 21.22 t/s (40% faster) on Qwen3.5 27B Q4_K_M with dual RTX 3060 + RTX 4070 Super?

Achieves 30.05 t/s vs 21.22 t/s (40% faster) on Qwen3.5 27B Q4_K_M with dual RTX 3060 + RTX 4070 Super.

Supports MTP speculative decoding, pushing real-world speeds from ~25 t/s to ~40 t/s in narrative contexts?

Supports MTP speculative decoding, pushing real-world speeds from ~25 t/s to ~40 t/s in narrative contexts.

Open Source

Llama.cpp fork delivers 40% faster inference on dual GPUs with tensor splitting fix

r/LocalLLaMA May 17, 2026

⚡A Reddit user's fork fixes a long-standing issue, boosting tokens per second by 40% on dual GPU setups.

Deep Dive

A Reddit user, Legitimate-Dog5690, has released a fork of llama.cpp that fixes a longstanding issue with tensor parallelism. The original '--split-mode tensor' option offered great performance but only supported non-quantized KV caches, forcing many users to stick with larger KV caches instead of utilizing dual GPUs. The fork, branched from the mainline as of today with minimal changes, enables quantized KV caches (q8_0) with tensor splitting, unlocking a significant speedup.

Benchmarks on the Qwen3.5 27B Q4_K_M model with a 3060 (12GB) + 4070 Super (12GB) show a 40% improvement in tokens per second: 30.05 t/s with tensor split versus 21.22 t/s without. For prompt processing, the fork achieves 544.82 t/s, close to the 582.60 t/s of single-GPU mode. In real-world usage, the user reports going from ~25 t/s to ~40 t/s in story-generation contexts, aided by MTP speculative decoding (e.g., `--spec-type draft-mtp`). The fork also supports the latest multi-token prediction changes.

Key limitations: MoE (Mixture of Experts) models have an unrelated issue with tensor splitting, so dense models like Qwen3.5 27B or 9B are recommended. The user invites feedback from those running dual 5060 Ti or Vulkan setups. If popular, they plan to fix MoE support and integrate Turboquants. Overall, this is a free, immediate performance boost for anyone with dual GPUs who runs local LLMs via llama.cpp.

Key Points

Fixes tensor parallelism ('--split-mode tensor') for quantized KV caches, previously only supported non-quantized.
Achieves 30.05 t/s vs 21.22 t/s (40% faster) on Qwen3.5 27B Q4_K_M with dual RTX 3060 + RTX 4070 Super.
Supports MTP speculative decoding, pushing real-world speeds from ~25 t/s to ~40 t/s in narrative contexts.

Why It Matters

Dual GPU owners get a free 40% inference speedup without quality loss, making local LLM usage more practical.

Read Original Article

Llama.cpp fork delivers 40% faster inference on dual GPUs with tensor splitting fix

Why It Matters

Related Articles

🚀 Stay Ahead in AI