Open Source

Dual RTX 3090 setup hits 113 tk/s with Qwen 3.6 27B locally

Ubuntu dual boot unlocks 4000 pp/s on 48GB VRAM—beats cloud inference speed.

Deep Dive

A Reddit user has demonstrated that dual RTX 3090s (48GB VRAM) can run Qwen 3.6 27B with 262K context at impressive speeds using the open-source 'club-3090' project. Originally running under WSL2, the user saw only 30 tk/s and 400 pp/s—but a switch to native Ubuntu dual boot unlocked 4000 tokens/s prompt processing and 113 tk/s generation, all without NVLink. The configuration required patches from Claude Sonnet to fix SSE-session drop and tool-calling bugs, but the result is local inference that feels 'almost-sonnet level' for code review and monkey patching.

The user believes this signals a viable path for budget local AI setups (≈$3K for two used 3090s). They note that current small models (27B) already outperform cloud APIs in speed while rivaling frontier intelligence in domain-specific tasks like SSH session management and code reviews. Speculation about next upgrades (M5 Ultra, DGX Sparks) suggests the community anticipates frontier-class small models within the next 12 months, further democratizing high-performance local AI.

Key Points
  • Dual RTX 3090 (48GB VRAM) runs Qwen 3.6 27B at 113 tk/s and 4000 pp/s on Ubuntu
  • Switching from WSL2 to native Linux increased prompt processing by 10x and generation by 3.7x
  • User calls local inference 'almost-sonnet level' for code review and agentic tasks like SSH management

Why It Matters

Shows that used dual 3090s can rival cloud AI speeds locally, enabling private, cost-effective coding assistants.