Open Source

Qwen3.6 Q6 quantization makes local coding agents viable on dual 3090s

Moving from Q4 to Q6 delivers production-grade quality at 20–50 tokens/sec

Deep Dive

A Reddit user reports switching from Ollama to llama.cpp's built-in server, noting outstanding quality improvement from Q4 to Q6 quantization. On a dual RTX 3090 setup (undervolted, limited to 65°C), the model generates 20–50 tokens per second with minimal heat. Multi-token prediction (MTP) provides a big performance gain, making local coding agents finally work.

Key Points
  • Transition from Ollama to llama.cpp server improved reliability and performance
  • Q6 quantization of Qwen3.6 delivers near-API quality for coding, versus Q4's lower fidelity
  • Dual RTX 3090 (undervolted, 65°C) achieves 20–50 t/s with MTP enabled, minimal heat

Why It Matters

Local coding agents become practical—cutting API costs while keeping sensitive code private on consumer GPUs.