Open Source

Dual 3090s run Qwen 27B at 128k context: skip VRAM hacks

Running Qwen3.6-27B on two used 3090s yields 1399 pp/s – worth the GPU investment.

Deep Dive

In a viral Reddit post, user MotokoAGI details their direct experience running the Qwen3.6-27B dense model on a dual RTX 3090 setup. Using Q8 quantization for weights and f16 precision for the key/value cache, they achieved 128K context length with 1399 prompt processing tokens per second and 104 tokens per second generation. The key takeaway: all the clever hacks to squeeze large models into limited VRAM — offloading to system RAM, shared memory, or fragmentation tricks — simply aren't worth the hassle. The user explicitly recommends acquiring enough GPU memory upfront, even if that means buying older cards like AMD MI50s or NVIDIA P40s.

The post underscores a growing consensus among AI practitioners: when deploying large language models locally, VRAM is the bottleneck. Trying to circumvent hardware limits with software tricks often leads to instability, slow inference, and endless debugging. The dual 3090 setup provides a clean 48GB pool, which comfortably hosts a 27B parameter model at decent quantization with full context. For professionals running local inference, this advice translates to a clear ROI — spend on proper GPU hardware to avoid recurring productivity losses from hacky workarounds.

Key Points
  • Qwen3.6-27B (dense) run on 2x RTX 3090s with Q8 + f16 K/V cache achieves 1399 pp/s and 104 tg/s at 128k context
  • User strongly discourages VRAM hacks (offloading, shared memory) — recommends buying enough GPUs (even used P40s/MI50s) instead
  • The model is a 27B dense architecture, not a MoE variant, requiring full VRAM for weights and cache

Why It Matters

Real-world local LLM deployment advice: investing in sufficient VRAM trumps complex workarounds for stability and speed.