125 tokens per second on a 35B Qwen 3.6 model using dual RTX 4060 Ti (32GB total) for under $1,000?

125 tokens per second on a 35B Qwen 3.6 model using dual RTX 4060 Ti (32GB total) for under $1,000.

Uses llama.cpp with CUDA 13.3, tensor split (0.97,0.97), 125K context, and speculative MTP decoding?

Uses llama.cpp with CUDA 13.3, tensor split (0.97,0.97), 125K context, and speculative MTP decoding.

Outperforms $5,000 2026-era mini PCs, setting a new standard for inference performance per dollar?

Outperforms $5,000 2026-era mini PCs, setting a new standard for inference performance per dollar.

Open Source

Dual RTX 4060 Ti delivers 125 tok/s on Qwen 3.6 for under $1K

r/LocalLLaMA May 30, 2026

⚡A sub-$1,000 dual 4060 Ti setup crushes $5K 2026 mini PCs at LLM inference.

Deep Dive

Reddit user Chuyito claims a setup costing under $1,000 with ~300W draw and 32GB VRAM outperforms $5k mini PCs from 2026. They're running Qwen 3.6 models at Q4_K_XL quant using llama.cpp with CUDA 13.3 and tensor split across two GPUs, asking if they can hit 150 tokens per second this weekend. The configuration uses speculative decoding via MTP and a specific podman command.

Key Points

125 tokens per second on a 35B Qwen 3.6 model using dual RTX 4060 Ti (32GB total) for under $1,000.
Uses llama.cpp with CUDA 13.3, tensor split (0.97,0.97), 125K context, and speculative MTP decoding.
Outperforms $5,000 2026-era mini PCs, setting a new standard for inference performance per dollar.

Why It Matters

Sub-$1K consumer GPUs now rival premium AI workstations for local LLM inference, democratizing high-speed AI.

Read Original Article

Dual RTX 4060 Ti delivers 125 tok/s on Qwen 3.6 for under $1K

Why It Matters

Related Articles

🚀 Stay Ahead in AI