Qwen3.6 27B runs at 46+ tokens/s on dual RX 9070 XTs using llama.cpp with speculative decoding (draft acceptance rate ~82%)?

Qwen3.6 27B runs at 46+ tokens/s on dual RX 9070 XTs using llama.cpp with speculative decoding (draft acceptance rate ~82%).

Prompt processing reaches 430 tokens/s with a 131K context window and flash attention enabled?

Prompt processing reaches 430 tokens/s with a 131K context window and flash attention enabled.

The model autonomously debugged a multi-service backend issue, isolating bugs to specific lines via logging, local/remote requests, and mocking?

The model autonomously debugged a multi-service backend issue, isolating bugs to specific lines via logging, local/remote requests, and mocking.

Open Source

Qwen3.6 27B on llama.cpp delivers blazing agentic inference speeds

r/LocalLLaMA May 21, 2026

⚡Two RX 9070 XTs push 46+ tokens/s for a dense 27B model

Deep Dive

A developer has posted an appreciation thread for the Qwen3.6 27B model running on llama.cpp across two AMD RX 9070 XT GPUs connected via PCIe 5.0 x8/x8. Power-limiting each card to ~235W, they configured the llama-server with speculative decoding using a draft-MTP approach and a 131K context window. The result: prompt evaluation speeds of 2.2–7 ms per token (up to 446 tokens/s), and generation speeds averaging 19–22 ms per token (~46 tokens/s) with a draft acceptance rate above 80%.

Despite using a 5-bit quantized GGUF (UD-Q5_K_XL) which the user admits is a bit low for their liking, the model demonstrated impressive agentic capabilities. During a real debugging session involving three backend services on separate instances with different configs, Qwen3.6 autonomously added logging, spun up local services, executed requests against local and remote instances, iterated on findings, and mocked non-critical parts to preserve reproducibility. The result: vague issues were pinpointed down to specific lines of code, all while maintaining high responsiveness for a dense 27B parameter model running on consumer-grade hardware.

Key Points

Qwen3.6 27B runs at 46+ tokens/s on dual RX 9070 XTs using llama.cpp with speculative decoding (draft acceptance rate ~82%).
Prompt processing reaches 430 tokens/s with a 131K context window and flash attention enabled.
The model autonomously debugged a multi-service backend issue, isolating bugs to specific lines via logging, local/remote requests, and mocking.

Why It Matters

Proves large open-weight models can run fast enough for agentic workflows on consumer GPU hardware.

Read Original Article

Qwen3.6 27B on llama.cpp delivers blazing agentic inference speeds

Why It Matters

Related Articles

🚀 Stay Ahead in AI