Open Source

Qwen3.6-35B-A3B hits 30+ tps on 8GB 3070 Ti with 262k context

A single 8GB GPU runs 1M context MoE model – no cloud needed

Deep Dive

A Reddit user has pushed the boundaries of local LLM inference by running the Qwen3.6-35B-A3B model (a 35B-parameter MoE with 3.5B active) on an 8GB RTX 3070 Ti. Using llama.cpp with APEX-I-Quality or Q4_K_XL quants, they achieve 30+ tokens per second at 262k context. The MoE model’s sparsity means only the active expert layers (~3GB) plus GPU buffers and KV cache fit in VRAM, leaving headroom for even 1M context (with IQ4_NL_XL and turbo4 KV quant) – though performance drops noticeably beyond 150k.

Switching from Windows 11 to Ubuntu Server (i3wm, no GPU compositor) boosted tps by ~25%: from under 27 tps (dropping at high context) to 34-37 tps stable. System RAM usage fell from 28GB+ to ~22GB, freeing 8GB for other tasks. The user warns against forcing all layers to VRAM or using extra runtime flags, which can exhaust memory. They plan to add a secondary low-end GPU for the OS to fully dedicate the 3070 Ti to inference.

Key Points
  • Qwen3.6-35B-A3B achieves 30+ tps at 262k context on an 8GB RTX 3070 Ti with Q4_K_XL quant
  • Ubuntu Server provides a 25% tps boost over Windows 11 and uses 6GB less system RAM
  • MoE model only requires ~3.5B active parameters in VRAM, enabling up to 1M context with IQ4_NL_XL

Why It Matters

Proves large-context MoE models are viable on consumer GPUs, democratizing advanced AI without cloud costs.