Open Source

Two old RTX 2080 Ti cards push Qwen3.6 27B to 38 token/s

A budget under $1K setup hits 38 t/s with 22GB VRAM each card, power-limited to 150W.

Deep Dive

A resourceful Reddit user (snapo84) shared their custom setup running the Qwen3.6 27B model on two heavily modified RTX 2080 Ti GPUs. Each card originally had 11GB of VRAM but was upgraded in China to 22GB, giving a combined 44GB for inference. Using llama.cpp with the IQ4_XS quantization and f16 key-value cache, they achieve 38 tokens per second while power-limiting both cards to 150W to keep noise low. The critical optimization was switching to --split-mode tensor, which jumped performance from 14 t/s to 38 t/s, as the cards are compute-bound not bandwidth-bound.

The user also emphasizes using --fit on to let the system automatically manage context length rather than manually setting it near VRAM capacity, which improved token generation slightly. They recommend avoiding manual context limits and note that q8_0 kv cache can cause loops during long coding sessions, so f16 is preferred. The total power draw at the wall is 400W, and the entire rig cost under $1,000 (excluding the mods). The setup works well with models like Hermes and Opencode, and the user provided their exact Docker Compose configuration and chat template reference.

Key Points
  • Two modded RTX 2080 Ti with 22GB VRAM each run Qwen3.6 27B at 38 token/s (IQ4_XS, f16 kv cache).
  • Using --split-mode tensor was the biggest boost: from 14 to 38 token/s; power limit to 150W per card.
  • Entire setup costs under $1K, draws 400W peak, and avoids model looping during long coding sessions.

Why It Matters

Shows that older, budget hardware can run large 27B models quickly with smart quantization and tensor splitting.