MoE offloading streams cold expert weights over PCIe 3.0, keeping hot layers + KV cache on GPU?

MoE offloading streams cold expert weights over PCIe 3.0, keeping hot layers + KV cache on GPU

MTP speculative decoding only gains ~5% until the token embedding table is forced onto GPU, then reaches ~22% speedup?

MTP speculative decoding only gains ~5% until the token embedding table is forced onto GPU, then reaches ~22% speedup

Setup required gcc-14, legacy NVIDIA drivers, and patched CUDA for glibc compatibility on Fedora 42?

Setup required gcc-14, legacy NVIDIA drivers, and patched CUDA for glibc compatibility on Fedora 42

Open Source

Old GTX 1080 runs 30B MoE models at 24 tok/s

r/LocalLLaMA May 14, 2026

⚡MoE offloading and TurboQuant make 8GB VRAM run 30B models

Deep Dive

A Reddit user (mdda) has demonstrated that massive Mixture-of-Experts (MoE) models from Qwen and Gemma can run at impressive speeds on a decade-old GTX 1080 with just 8GB VRAM. The setup—a secondhand machine costing $200 (i7-6700, 32GB RAM)—achieved ~24 tok/s on Qwen 3.6 35B-A3B and ~20-24.5 tok/s on Gemma 4 26B-A4B, all with a full 128k token context window. The key enabler is llama.cpp's TurboQuant/RotorQuant KV cache quantization, which compresses the cache to fit within 8GB VRAM even at extreme context lengths.

The real innovation is MoE offloading: llama.cpp can keep hot transformer layers and the KV cache on the GPU while parking cold expert weights in system RAM. During inference, these weights are streamed over PCIe 3.0 x16 to the GPU on demand. The user found that the system is fully PCIe bandwidth-limited (GPU utilization sits at 40-50% while PCIe is maxed out). For Gemma 4's Multi-Token Prediction (MTP) mode, an important fix was needed: by default, the token embedding table stays on CPU, crippling MTP speculative decoding. Manually forcing the embedding table onto GPU with --override-tensor-draft improved MTP speedup from ~5% to ~22% with a 79% draft acceptance rate.

Setting up the system required several workarounds on Fedora 42: pinning NVIDIA drivers to the legacy 580xx branch for Pascal GPUs, using gcc-14 for CUDA 12.9 compatibility, and patching CUDA's math_functions.h for glibc 2.41. The user used the AtomicBot-ai/atomic-llama-cpp-turboquant fork for TurboQuant and MTP support. A full guide and GitHub link were provided.

Key Points

MoE offloading streams cold expert weights over PCIe 3.0, keeping hot layers + KV cache on GPU
MTP speculative decoding only gains ~5% until the token embedding table is forced onto GPU, then reaches ~22% speedup
Setup required gcc-14, legacy NVIDIA drivers, and patched CUDA for glibc compatibility on Fedora 42

Why It Matters

Makes advanced MoE models accessible on budget 2016 hardware, democratizing AI inference for hobbyists.

Read Original Article

Old GTX 1080 runs 30B MoE models at 24 tok/s

Why It Matters

Related Articles

🚀 Stay Ahead in AI