Open Source

Dual RTX 3060 setup runs Qwen 3.6-27B at 43 t/s for $400

Budget dual RTX 3060 outperforms AMD 7900 XTX in AI inference stability

Deep Dive

A hobbyist AI builder has demonstrated a remarkably cost-effective setup for running large language models locally. Using two RTX 3060 GPUs (12GB each, total ~$400) on an ancient i7-4770K platform, they achieved impressive inference speeds on Qwen 3.6-27B, a 27B-parameter model from Alibaba. The dual 3060s delivered 456 tokens per second (t/s) during prompt processing and 43.26 t/s during text generation using the Q4_K_S quantized GGUF format. This outperformed their previous AMD 7900 XTX ($900) experience, which suffered from unstable compute performance and slower prefill speeds (300-500 t/s).

The key enabler is llama.cpp compiled with CUDA support, employing tensor parallel splitting across both GPUs along with MTP (Multi-Token Prediction) speculative decoding. The builder used PCIe 3.0 x8 per GPU via SLI, equivalent to PCIe 4.0 x4 available on modern boards. However, a notable limitation: tensor parallel mode prevents use of KV cache quantization, restricting context window to roughly 64k tokens—far below the 160k the user typically needs. Additionally, MTP with draft_n_max=2 caused VRAM OOM issues, so only single-token drafts were stable. At 12k context, performance remained consistent, showing this budget configuration can rival or exceed much pricier platforms for local LLM inference if context requirements are modest.

Key Points
  • Dual RTX 3060 (24GB total, ~$400) delivers 43.26 t/s text generation on Qwen 3.6-27B Q4_K_S
  • Prompt processing hit 456 t/s, outperforming a $900 AMD 7900 XTX in stability and speed
  • Tensor parallel blocks KV cache quantization, capping context at ~64k; MTP limited to single-token drafts to avoid OOM

Why It Matters

Demonstrates budget dual-GPU setups can match high-end cards for local LLM inference, lowering the barrier for AI professionals.