Open Source

Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload

New caching technique predicts which experts to load into VRAM, achieving 22.67 tokens/sec generation.

Deep Dive

A new optimization technique for llama.cpp, dubbed a "dynamic expert cache," significantly improves the performance of massive Mixture-of-Experts (MoE) models on systems with limited GPU memory. Developed by a community member, the code intelligently monitors which of a model's many "expert" sub-networks are used most frequently over a rolling window of tokens. It then makes a calculated bet, loading those high-priority experts from system RAM into the faster VRAM of the GPU. This proactive caching aims to outweigh the latency cost of data transfer with faster processing speed, rebalancing the cache at set intervals.

Benchmarked against the Qwen3.5-122B-A10B model on a setup with an RTX 4090 (24GB VRAM) and a Ryzen 9 CPU, the results are compelling. Compared to a baseline where all experts are offloaded to the CPU (15.65 tokens/sec), the new cache delivered a 44.8% speedup to 22.67 tokens/sec for token generation. More importantly, it outperformed a standard layer-based partial offload method using equivalent VRAM by 26.8%. This means users can now run 122-billion-parameter models on a single high-end consumer GPU at speeds that make interactive use feasible, without requiring expensive unified memory systems.

The implementation, currently available as code in a GitHub fork, introduces new llama.cpp arguments like `LLAMA_ARG_MOE_HOT_K` to set the cache size and `LLAMA_ARG_MOE_HOT_REBALANCE_INTERVAL` to control how often the system reevaluates its expert predictions. While prompt processing speed remains similar to other methods, the breakthrough is in generation latency. This approach is a clever software solution to a hardware limitation, making state-of-the-art large language models more accessible to researchers and developers without datacenter-scale resources.

Key Points
  • Achieves 22.67 tokens/sec generation for Qwen3.5-122B-A10B, a 44.8% speedup over all-CPU offloading.
  • Outperforms traditional layer-based GPU offload by 26.8% while using a similar amount of VRAM (~22 GB).
  • Uses predictive caching to load frequently accessed "experts" into GPU VRAM, rebalancing dynamically every N tokens.

Why It Matters

Dramatically lowers the hardware barrier for running massive 100B+ parameter MoE models, making advanced AI more accessible.