llama.cpp ubatch tuning boosts MoE prompt processing 5.5x on RTX 3090
Increase ubatch to 8192 and move MoE layers to CPU for 2091 tok/s
A Reddit user (coder543) shared a valuable tuning trick for llama.cpp that drastically speeds up prompt processing on MoE (Mixture of Experts) models. By increasing the physical micro-batch size (-ub) from the default 512 to 8192 and adjusting the --n-cpu-moe parameter to offload more MoE layers to CPU, they achieved a 5.5x boost in prefill throughput on a 24GB RTX 3090 running gpt-oss-120b-F16.gguf. Prefill speed jumped from 380 tok/s to 2091 tok/s, while token generation dropped only 7% (from 32.3 to 30.1 tok/s). The larger ubatch demands more GPU compute workspace; ubatch 8192 required --n-cpu-moe 28 (default 26). This is an informal benchmark (pp4096 vs pp8192 at highest setting) but the trend is clear.
The technique highlights an important trade-off for local LLM deployment: moving a few MoE layers to CPU frees up VRAM for a larger batch size, which massively accelerates prompt processing at a small cost to generation speed. The user noted this trick nearly closes the gap with their DGX Spark, which offers slightly better prompt speeds and double the generation rate for the same model. For hobbyists and professionals running prompt-heavy applications (e.g., RAG, chatbots, batch inference) on consumer GPUs, this tuning parameter is a game-changer, making large MoE models more practical without hardware upgrades.
- Increasing -ub from 512 to 8192 yields 5.5x prompt processing speed (380→2091 tok/s) on RTX 3090
- Requires moving MoE layers to CPU via --n-cpu-moe 28 to fit larger batch in 24GB VRAM
- Generation speed drops only 7% (32.3→30.1 tok/s), making it ideal for prompt-heavy workloads
Why It Matters
Makes 120B MoE models viable on consumer GPUs for prompt-heavy tasks like RAG and chatbots