Qwen3.6 27B on llama.cpp delivers blazing agentic inference speeds
Two RX 9070 XTs push 46+ tokens/s for a dense 27B model
A developer has posted an appreciation thread for the Qwen3.6 27B model running on llama.cpp across two AMD RX 9070 XT GPUs connected via PCIe 5.0 x8/x8. Power-limiting each card to ~235W, they configured the llama-server with speculative decoding using a draft-MTP approach and a 131K context window. The result: prompt evaluation speeds of 2.2–7 ms per token (up to 446 tokens/s), and generation speeds averaging 19–22 ms per token (~46 tokens/s) with a draft acceptance rate above 80%.
Despite using a 5-bit quantized GGUF (UD-Q5_K_XL) which the user admits is a bit low for their liking, the model demonstrated impressive agentic capabilities. During a real debugging session involving three backend services on separate instances with different configs, Qwen3.6 autonomously added logging, spun up local services, executed requests against local and remote instances, iterated on findings, and mocked non-critical parts to preserve reproducibility. The result: vague issues were pinpointed down to specific lines of code, all while maintaining high responsiveness for a dense 27B parameter model running on consumer-grade hardware.
- Qwen3.6 27B runs at 46+ tokens/s on dual RX 9070 XTs using llama.cpp with speculative decoding (draft acceptance rate ~82%).
- Prompt processing reaches 430 tokens/s with a 131K context window and flash attention enabled.
- The model autonomously debugged a multi-service backend issue, isolating bugs to specific lines via logging, local/remote requests, and mocking.
Why It Matters
Proves large open-weight models can run fast enough for agentic workflows on consumer GPU hardware.