Open Source

Run Qwen3.5 flagship model with 397 billion parameters at 5 – 9 tok/s on a $2,100 desktop! Two $500 GPUs, 32GB RAM, one NVMe drive. Uses Q4_K_M quants

New system achieves 60% VRAM hit rate, cutting NVMe reads to 7% for practical MoE inference.

Deep Dive

A breakthrough system called FOMOE (Fast Opportunistic Mixture of Experts) makes running massive 397-billion-parameter models like Qwen3.5 practical on consumer hardware. The core innovation solves the fundamental problem with Mixture of Experts (MoE) models: their hundreds of gigabytes of weights typically require expensive, high-bandwidth memory systems because you can't predict which expert weights will be needed next. FOMOE's architecture stores common experts in GPU VRAM, maintains a rolling cache, and uses a dual-GPU setup to overlap weight loading with computation, achieving a 60% VRAM hit rate.

The system's experimental Cache-Aware Routing (CAR) feature provides the biggest performance leap. When the model needs to select an expert, CAR can choose the next-best scoring expert that's already cached in VRAM or system RAM, provided it scores within an acceptable threshold. This reduces NVMe reads from flash storage down to just 7%, enabling speeds of approximately 9 tokens per second. The trade-off is minimal—only a 3.5% increase in perplexity on wikitext benchmarks—making this a viable approach for practical applications.

Built with approximately 15,000 lines of Claude-assisted C/HIP code, FOMOE demonstrates that with clever caching and routing algorithms, the industry's largest models can become accessible without requiring data center-scale infrastructure. This development could significantly lower the barrier to experimenting with and deploying state-of-the-art MoE architectures.

Key Points
  • Enables Qwen3.5's 397B MoE model to run at 5-9 tok/s on a $2,100 desktop with dual $500 GPUs
  • Cache-Aware Routing (CAR) cuts NVMe reads to 7% by selecting cached experts, with only 3.5% perplexity penalty
  • Uses dual-GPU ping-pong architecture and rolling expert cache to achieve 60% VRAM hit rate

Why It Matters

Dramatically lowers the cost and hardware barrier for running cutting-edge, massive-scale AI models locally for research and development.