Open Source

Multi-GPU inference: CUDA beats Vulkan with 10x speed and lower memory overhead

A budget build with 2x RTX 3090 and an Arc A770 reveals harsh truths.

Deep Dive

A Reddit user built a multi-GPU inference server with 2x RTX 3090 (24GB each) and an Intel Arc A770 (16GB). All parts were purchased second-hand except the RAM sticks and the case. Testing llama.cpp, they found CUDA runs Qwen 3.6 27b at 30 tokens/s with 170k context, while Vulkan's memory overhead (5GB extra per card) slashed context to 50k and speed to 3 tokens/s. Lesson: stick to a single vendor and use their own backend.

Key Points
  • 2x RTX 3090 + Arc A770: CUDA runs Qwen 27b at 30 tokens/s with 170k context; Vulkan drops to 3 tokens/s with 50k context
  • Vulkan memory overhead is 5GB extra per 24GB card, leaving little room for KV cache
  • Lesson: For multi-GPU inference, stick to a single vendor's native backend (e.g., CUDA for Nvidia) to avoid severe performance penalties

Why It Matters

Highlights that mixing GPU brands for inference is inefficient; professionals should standardize on one vendor for cost-effective performance.

📬 Get the top 10 AI stories daily