Run Qwen3.5 flagship model with 397 billion parameters at 5 – 9 tok/s on a $2,100 desktop! Two $500 GPUs, 32GB RAM, one NVMe drive. Uses Q4_K_M quants
New system achieves 60% VRAM hit rate, cutting NVMe reads to 7% for practical MoE inference.
A breakthrough system called FOMOE (Fast Opportunistic Mixture of Experts) makes running massive 397-billion-parameter models like Qwen3.5 practical on consumer hardware. The core innovation solves the fundamental problem with Mixture of Experts (MoE) models: their hundreds of gigabytes of weights typically require expensive, high-bandwidth memory systems because you can't predict which expert weights will be needed next. FOMOE's architecture stores common experts in GPU VRAM, maintains a rolling cache, and uses a dual-GPU setup to overlap weight loading with computation, achieving a 60% VRAM hit rate.
The system's experimental Cache-Aware Routing (CAR) feature provides the biggest performance leap. When the model needs to select an expert, CAR can choose the next-best scoring expert that's already cached in VRAM or system RAM, provided it scores within an acceptable threshold. This reduces NVMe reads from flash storage down to just 7%, enabling speeds of approximately 9 tokens per second. The trade-off is minimal—only a 3.5% increase in perplexity on wikitext benchmarks—making this a viable approach for practical applications.
Built with approximately 15,000 lines of Claude-assisted C/HIP code, FOMOE demonstrates that with clever caching and routing algorithms, the industry's largest models can become accessible without requiring data center-scale infrastructure. This development could significantly lower the barrier to experimenting with and deploying state-of-the-art MoE architectures.
- Enables Qwen3.5's 397B MoE model to run at 5-9 tok/s on a $2,100 desktop with dual $500 GPUs
- Cache-Aware Routing (CAR) cuts NVMe reads to 7% by selecting cached experts, with only 3.5% perplexity penalty
- Uses dual-GPU ping-pong architecture and rolling expert cache to achieve 60% VRAM hit rate
Why It Matters
Dramatically lowers the cost and hardware barrier for running cutting-edge, massive-scale AI models locally for research and development.