Open Source

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

r/LocalLLaMA April 18, 2026

⚡A new llama.cpp flag unlocks 54% faster AI inference on consumer GPUs by optimizing VRAM usage for MoE models.

Deep Dive

A detailed technical experiment has revealed a major performance optimization for running large language models locally. By replacing the commonly recommended `--cpu-moe` flag in llama.cpp with the more granular `--n-cpu-moe N` flag, a user achieved a 54% speed boost running the 35-billion parameter Qwen3.6-A3B model. The key insight is that `--cpu-moe` inefficiently pushes all MoE layers to the CPU, leaving most of a GPU's VRAM idle. In contrast, `--n-cpu-moe 20` strategically keeps the first 20 layers of the 40-layer MoE model on the CPU and places the rest on the GPU, fully utilizing the RTX 5070 Ti's 16GB of VRAM.

The result was a jump from 51.2 to 79.3 tokens per second for generation, with prompt processing also increasing by 54%. Furthermore, by adding the `-np 1` flag to optimize for a single user, the setup could handle a 128,000-token context window with only a minimal VRAM overhead of 1.36 GB. The benchmark was run on consumer-grade hardware: an NVIDIA RTX 5070 Ti GPU and an AMD Ryzen 9800X3D CPU. Notably, the entire tuning process was orchestrated autonomously by Anthropic's Claude Opus 4.7, which configured the server, ran benchmarks, and iterated on the settings based on the hardware specs.

This discovery provides a clear tuning guide for users with different GPU memory capacities. For a 40-layer MoE model, the recommended `N` value is 20 for a 16GB GPU, 26 for 12GB, and 8 for 24GB. The post serves as a practical blueprint for enthusiasts and developers looking to maximize the performance of state-of-the-art MoE models on their own hardware, moving beyond one-size-fits-all configuration advice.

Key Points

Replacing `--cpu-moe` with `--n-cpu-moe 20` in llama.cpp yielded a 54% speed increase, from 51.2 to 79.3 tokens/sec for Qwen3.6-35B.
The optimization properly utilizes GPU VRAM, increasing usage from 3.5GB to 12.7GB on a 16GB RTX 5070 Ti and enabling a 128K context window.
Claude Opus 4.7 autonomously handled the entire hardware tuning loop, from config building to benchmark iteration, showcasing advanced AI agent capabilities.

Why It Matters

Enables significantly faster and more capable local AI inference on consumer hardware, democratizing access to powerful 35B+ parameter models.

Read Original Article

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

Why It Matters

Stay Ahead in AI