2x RTX 3090 + Arc A770?

CUDA runs Qwen 27b at 30 tokens/s with 170k context; Vulkan drops to 3 tokens/s with 50k context

Vulkan memory overhead is 5GB extra per 24GB card, leaving little room for KV cache?

Vulkan memory overhead is 5GB extra per 24GB card, leaving little room for KV cache

For multi-GPU inference, stick to a single vendor's native backend (e.g., CUDA for Nvidia) to avoid severe performance penalties

Open Source

Multi-GPU inference: CUDA beats Vulkan with 10x speed and lower memory overhead

r/LocalLLaMA June 27, 2026

⚡A budget build with 2x RTX 3090 and an Arc A770 reveals harsh truths.

Deep Dive

A Reddit user built a multi-GPU inference server with 2x RTX 3090 (24GB each) and an Intel Arc A770 (16GB). All parts were purchased second-hand except the RAM sticks and the case. Testing llama.cpp, they found CUDA runs Qwen 3.6 27b at 30 tokens/s with 170k context, while Vulkan's memory overhead (5GB extra per card) slashed context to 50k and speed to 3 tokens/s. Lesson: stick to a single vendor and use their own backend.

Key Points

2x RTX 3090 + Arc A770: CUDA runs Qwen 27b at 30 tokens/s with 170k context; Vulkan drops to 3 tokens/s with 50k context
Vulkan memory overhead is 5GB extra per 24GB card, leaving little room for KV cache
Lesson: For multi-GPU inference, stick to a single vendor's native backend (e.g., CUDA for Nvidia) to avoid severe performance penalties

Why It Matters

Highlights that mixing GPU brands for inference is inefficient; professionals should standardize on one vendor for cost-effective performance.

Read Original Article

Multi-GPU inference: CUDA beats Vulkan with 10x speed and lower memory overhead

Why It Matters

Related Articles

🚀 Stay Ahead in AI