Open Source

Old GTX 1080 runs 30B MoE models at 24 tok/s

MoE offloading and TurboQuant make 8GB VRAM run 30B models

Deep Dive

A Reddit user (mdda) has demonstrated that massive Mixture-of-Experts (MoE) models from Qwen and Gemma can run at impressive speeds on a decade-old GTX 1080 with just 8GB VRAM. The setup—a secondhand machine costing $200 (i7-6700, 32GB RAM)—achieved ~24 tok/s on Qwen 3.6 35B-A3B and ~20-24.5 tok/s on Gemma 4 26B-A4B, all with a full 128k token context window. The key enabler is llama.cpp's TurboQuant/RotorQuant KV cache quantization, which compresses the cache to fit within 8GB VRAM even at extreme context lengths.

The real innovation is MoE offloading: llama.cpp can keep hot transformer layers and the KV cache on the GPU while parking cold expert weights in system RAM. During inference, these weights are streamed over PCIe 3.0 x16 to the GPU on demand. The user found that the system is fully PCIe bandwidth-limited (GPU utilization sits at 40-50% while PCIe is maxed out). For Gemma 4's Multi-Token Prediction (MTP) mode, an important fix was needed: by default, the token embedding table stays on CPU, crippling MTP speculative decoding. Manually forcing the embedding table onto GPU with --override-tensor-draft improved MTP speedup from ~5% to ~22% with a 79% draft acceptance rate.

Setting up the system required several workarounds on Fedora 42: pinning NVIDIA drivers to the legacy 580xx branch for Pascal GPUs, using gcc-14 for CUDA 12.9 compatibility, and patching CUDA's math_functions.h for glibc 2.41. The user used the AtomicBot-ai/atomic-llama-cpp-turboquant fork for TurboQuant and MTP support. A full guide and GitHub link were provided.

Key Points
  • MoE offloading streams cold expert weights over PCIe 3.0, keeping hot layers + KV cache on GPU
  • MTP speculative decoding only gains ~5% until the token embedding table is forced onto GPU, then reaches ~22% speedup
  • Setup required gcc-14, legacy NVIDIA drivers, and patched CUDA for glibc compatibility on Fedora 42

Why It Matters

Makes advanced MoE models accessible on budget 2016 hardware, democratizing AI inference for hobbyists.