Peak efficiency of 1.13 tokens/joule at 220W per GPU, vs 0.77 at unrestricted power?

Peak efficiency of 1.13 tokens/joule at 220W per GPU, vs 0.77 at unrestricted power

Output only drops from 29 to 27 t/s when lowering power from 390W to 220W (less than 7% loss)?

Output only drops from 29 to 27 t/s when lowering power from 390W to 220W (less than 7% loss)

Diminishing returns clearly observed?

raising power above 250W provides negligible throughput gains

Open Source

Reddit user finds 220W sweet spot for 4x RTX 3090 AI inference

r/LocalLLaMA May 16, 2026

⚡Peak efficiency at 220W per GPU with Qwen 3.6-27B, minimal performance loss

Deep Dive

A Reddit user (anitamaxwynnn69) conducted a rigorous power efficiency test on a 4x RTX 3090 setup running AI inference with vLLM v0.20.2 and the Qwen3.6-27B model at FP16 precision, using tensor parallelism (TP=4). The GPUs included Dell OEM, EVGA XC3, and two ASUS Strix cards, connected via PCIe Gen3 (bifurcated x16/x8/x8/x4). The user tested power limits from 200W to unrestricted (350-390W) and measured output tokens per second (t/s), prompt processing speed, and total throughput efficiency in tokens per joule.

Key results: At the unrestricted 350-390W, total throughput was 269 t/s with an efficiency of 0.77 t/joule. Dropping to 220W gave 248 t/s and peak efficiency of 1.13 t/joule — a ~46% efficiency gain with only an 8% throughput loss. Above 250W, returns diminished sharply. The sweet spot of 220W matches earlier blog findings (referenced in the post). The setup uses an open mining frame with 10x TL-C12C-S fans. The user is now exploring larger models like DeepSeek v4 at Q2, while noting that Qwen 3.6 27B remains a solid daily driver for multi-GPU hobbyists.

Key Points

Peak efficiency of 1.13 tokens/joule at 220W per GPU, vs 0.77 at unrestricted power
Output only drops from 29 to 27 t/s when lowering power from 390W to 220W (less than 7% loss)
Diminishing returns clearly observed: raising power above 250W provides negligible throughput gains

Why It Matters

Power optimization cuts electricity costs and heat for multi-GPU AI rigs, making high-end inference more accessible.

Read Original Article

Reddit user finds 220W sweet spot for 4x RTX 3090 AI inference

Why It Matters

Related Articles

🚀 Stay Ahead in AI