Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6
A massive 32-GPU AMD setup pushes Kimi K2.6 at 264 tok/s input but guzzles 4.8kW.
A Reddit user known as ai-infos has showcased a homebrew AI inference monster: 32 AMD MI50 GPUs (32GB each) split across two nodes, running the Kimi K2.6 model in int4 quantization. The setup, connected via 10G Ethernet and a custom vLLM fork (vllm-gfx906-mobydick), delivers 9.7 tokens per second during generation and a blazing 264 tok/s during prefill. Power consumption is staggering — 640W idle and up to 4,800W under peak load.
Despite the impressive raw throughput, the author notes severe PCIe bandwidth bottlenecks (some links run at only 3.5 GB/s due to riser instability) and admits the system is not cost-effective. Theoretically, with full PCIe Gen4 x16 and optimized software, performance could hit 600–1000 tok/s prefill and 9–12 tok/s generation. But the builder is calling it quits, concluding that smaller models on leaner hardware offer a better experience. The project remains a fascinating proof-of-concept for extreme local LLM inference on AMD GPUs.
- 32 AMD MI50 32GB GPUs in two nodes output 9.7 tok/s generation and 264 tok/s prefill on Kimi K2.6 (int4).
- Peak power draw is 4,800W; idle draw is 640W — requires free energy to be viable.
- Custom vLLM fork (vllm-gfx906-mobydick) and 10G Ethernet interconnect enable multi-node inference.
Why It Matters
Shows the outer limits of local LLM inference with AMD GPUs, but at prohibitive energy cost.