Open Source

Used ray tracing cores on my RTX 5070 Ti for LLM routing — 218x speedup, runs entirely on 1 consumer GPU

Using idle ray tracing hardware for AI routing delivers massive speedups and challenges MoE expert specialization.

Deep Dive

An independent researcher has unlocked a massive performance boost for Mixture-of-Experts models by repurposing idle hardware. Jordi Silvestre's project, Spectral-AI, cleverly uses the RT Cores on NVIDIA consumer GPUs—dedicated silicon normally reserved for ray tracing in games—to handle the complex routing decisions in MoE architectures. On an RTX 5070 Ti, this hack accelerated routing by 218 times for large batches and slashed VRAM usage for that task by 731-fold, all while adding a minimal 1.5% perplexity penalty. This breakthrough could make running sophisticated, multi-expert models far more efficient on standard hardware.

The research yielded a second, paradigm-challenging discovery. By testing across models like OLMoE, Qwen-MoE, and DeepSeek-MoE, Silvestre found that experts within these systems don't specialize by subject matter (like a 'science expert'), as commonly assumed. Instead, they organize by syntactic function, handling categories like content words, function words, and punctuation. The fully open-sourced code and papers provide a roadmap for others to build on this efficient, hardware-aware approach to AI inference.

Key Points
  • Repurposes idle NVIDIA RT Cores for a 218x routing speedup on an RTX 5070 Ti.
  • Reduces VRAM usage for MoE routing by 731x with only a 1.5% perplexity increase.
  • Challenges the 'topic expert' myth, showing MoE experts specialize by syntactic type.

Why It Matters

Enables running complex MoE models efficiently on consumer hardware, potentially democratizing advanced AI inference.