RTX 5080 setup runs local autocomplete and agentic coding with Qwen models
A single 16GB GPU handles both instant infill and reliable agentic tasks with 64GB RAM.
A developer has shared a practical local setup combining two Qwen models for AI-assisted coding on a single RTX 5080 (16GB VRAM) with system RAM offloading. For autocomplete and infill, the setup uses bartowski/Qwen2.5-Coder-7B-Instruct at Q6_K_L quantization, consuming about 8GB VRAM and producing essentially instant suggestions. The developer reports that Qwen2.5-Coder still leads for infill tasks, outperforming alternatives like Gemma4 and Qwen3.5 variants which produced 'weird suggestions.' The model runs with llama-server using specific hyperparameters (temp 0.5, top-p 0.95, top-k 20, min-p 0.0) on port 8081.
For agentic coding, the developer chose unsloth/Qwen3.6-35B-A3B-GGUF at UD-Q8_K_XL quantization. This mixture-of-experts model activates only 3B parameters per token, allowing it to fit in the remaining 8GB VRAM while maintaining quality. At Q8, it 'can figure stuff out and actually finish its work correctly,' whereas lower quants (Q4) become unusable. The setup achieves ~35 tokens per second generation speed and approximately 145k context with llama.cpp's autofit. With both models running, total RAM usage sits around 56GB (including browser, IDE, Teams), so the developer recommends 64GB minimum. This configuration enables fully offline, high-quality coding assistance without cloud dependencies.
- Qwen2.5-Coder-7B at Q6_K_L uses ~8GB VRAM for instant autocomplete/infill, outperforming other local models.
- Qwen3.6-35B-A3B at Q8 activation uses only 3B active params, fits in remaining 8GB VRAM, and delivers reliable agentic coding at ~35 tok/s.
- Requires 64GB+ system RAM for offloading; setup achieves ~145k context with llama.cpp autofit.
Why It Matters
Enables professional-grade AI coding assistance entirely offline on a single consumer GPU, removing cloud latency and cost.