llama.cpp fork boosts MoE inference by caching experts instead of layers
Experimental fork stores active experts in VRAM, boosting speeds on 12GB GPUs.
A developer frustrated by VRAM constraints on his RTX 2060 12GB created an experimental fork of llama.cpp that rethinks how Mixture-of-Experts (MoE) models are loaded. Instead of splitting entire layers between CPU and GPU—which wastes space on unused experts—the fork profiles which experts are actually activated per token and caches only the “hot” ones in VRAM. A new UI shows expert usage in real time.
In initial benchmarks with Qwen 35B A3B (Q6 quantization, 100k context, no context quantization for coding accuracy), the standard n-cpu-moe approach delivered ~19 tk/s. The expert caching variant hit ~22 tk/s immediately, and after a 62% cache-hit rate it reached 26 tk/s—a 37% improvement. The break-even point was just 42% hit rate. The fork also supports a --moe-hot-cache argument to control VRAM usage. The developer now invites others with RTX 3060/4060 or similar GPUs to test the fork on Linux with CUDA, hoping to see how token generation scales across consumer hardware.
- Performance increased from 19 to 26 tk/s on RTX 2060 12GB with Qwen 35B A3B (37% gain)
- Break-even cache-hit rate is 42%; at 62% hit rate the speed improvement is substantial
- Requires Linux, CUDA, and models like Qwen 35B A3B or Gemma 26B A4B to reproduce
Why It Matters
Enables larger contexts and faster coding workflows on consumer GPUs without requiring high-end hardware.