Dual RTX 3060 setup runs Qwen 3.6-27B at 43 t/s for $400
Budget dual RTX 3060 outperforms AMD 7900 XTX in AI inference stability
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A hobbyist AI builder has demonstrated a remarkably cost-effective setup for running large language models locally. Using two RTX 3060 GPUs (12GB each, total ~$400) on an ancient i7-4770K platform, they achieved impressive inference speeds on Qwen 3.6-27B, a 27B-parameter model from Alibaba. The dual 3060s delivered 456 tokens per second (t/s) during prompt processing and 43.26 t/s during text generation using the Q4_K_S quantized GGUF format. This outperformed their previous AMD 7900 XTX ($900) experience, which suffered from unstable compute performance and slower prefill speeds (300-500 t/s).
The key enabler is llama.cpp compiled with CUDA support, employing tensor parallel splitting across both GPUs along with MTP (Multi-Token Prediction) speculative decoding. The builder used PCIe 3.0 x8 per GPU via SLI, equivalent to PCIe 4.0 x4 available on modern boards. However, a notable limitation: tensor parallel mode prevents use of KV cache quantization, restricting context window to roughly 64k tokens—far below the 160k the user typically needs. Additionally, MTP with draft_n_max=2 caused VRAM OOM issues, so only single-token drafts were stable. At 12k context, performance remained consistent, showing this budget configuration can rival or exceed much pricier platforms for local LLM inference if context requirements are modest.
- Dual RTX 3060 (24GB total, ~$400) delivers 43.26 t/s text generation on Qwen 3.6-27B Q4_K_S
- Prompt processing hit 456 t/s, outperforming a $900 AMD 7900 XTX in stability and speed
- Tensor parallel blocks KV cache quantization, capping context at ~64k; MTP limited to single-token drafts to avoid OOM
Why It Matters
Demonstrates budget dual-GPU setups can match high-end cards for local LLM inference, lowering the barrier for AI professionals.