Performance increased from 19 to 26 tk/s on RTX 2060 12GB with Qwen 35B A3B (37% gain)?

Performance increased from 19 to 26 tk/s on RTX 2060 12GB with Qwen 35B A3B (37% gain)

Break-even cache-hit rate is 42%; at 62% hit rate the speed improvement is substantial?

Break-even cache-hit rate is 42%; at 62% hit rate the speed improvement is substantial

Requires Linux, CUDA, and models like Qwen 35B A3B or Gemma 26B A4B to reproduce?

Requires Linux, CUDA, and models like Qwen 35B A3B or Gemma 26B A4B to reproduce

Open Source

llama.cpp fork boosts MoE inference by caching experts instead of layers

r/LocalLLaMA May 22, 2026

⚡Experimental fork stores active experts in VRAM, boosting speeds on 12GB GPUs.

Deep Dive

A developer frustrated by VRAM constraints on his RTX 2060 12GB created an experimental fork of llama.cpp that rethinks how Mixture-of-Experts (MoE) models are loaded. Instead of splitting entire layers between CPU and GPU—which wastes space on unused experts—the fork profiles which experts are actually activated per token and caches only the “hot” ones in VRAM. A new UI shows expert usage in real time.

In initial benchmarks with Qwen 35B A3B (Q6 quantization, 100k context, no context quantization for coding accuracy), the standard n-cpu-moe approach delivered ~19 tk/s. The expert caching variant hit ~22 tk/s immediately, and after a 62% cache-hit rate it reached 26 tk/s—a 37% improvement. The break-even point was just 42% hit rate. The fork also supports a --moe-hot-cache argument to control VRAM usage. The developer now invites others with RTX 3060/4060 or similar GPUs to test the fork on Linux with CUDA, hoping to see how token generation scales across consumer hardware.

Key Points

Performance increased from 19 to 26 tk/s on RTX 2060 12GB with Qwen 35B A3B (37% gain)
Break-even cache-hit rate is 42%; at 62% hit rate the speed improvement is substantial
Requires Linux, CUDA, and models like Qwen 35B A3B or Gemma 26B A4B to reproduce

Why It Matters

Enables larger contexts and faster coding workflows on consumer GPUs without requiring high-end hardware.

Read Original Article

llama.cpp fork boosts MoE inference by caching experts instead of layers

Why It Matters

Related Articles

🚀 Stay Ahead in AI