Open Source

Fix: Dual Intel Arc GPUs using all system RAM during inference - found the cause and a working fix (llama.cpp SYCL)

Dual Arc GPUs were mirroring VRAM to system RAM, consuming 500x more memory than needed.

Deep Dive

Users running dual Intel Arc GPUs with llama.cpp's SYCL backend encountered a critical memory bug where their entire system RAM would max out during inference, even when models fit comfortably in VRAM. The issue affected configurations like dual Arc Pro B70s (64GB total VRAM), where loading a 15GB model could consume 46GB of system RAM, triggering system crashes or OOM killer processes. The root cause wasn't model size or configuration errors, but a specific API interaction between llama.cpp and Intel's driver.

The problem stems from sycl::malloc_device() calls triggering Intel's xe kernel driver to use DMA-buf/TTM memory paths, which create 1:1 mirrors of GPU allocations in system RAM. Every tensor, KV cache buffer, and compute scratch buffer allocated on GPU consumed equal system RAM. Testing revealed zeMemAllocDevice() uses SVM/P2P paths with minimal system impact—just 8MB versus 4GB for the same 4GB VRAM allocation. The developer created a centralized fix replacing sycl::malloc_device() with Level Zero's zeMemAllocDevice() throughout llama.cpp's SYCL backend, maintaining full SYCL kernel compatibility while eliminating the memory mirroring. This resolves crashes, login screen drops, and eliminates the misconception that dual-GPU setups require 128GB+ system RAM.

Key Points
  • Dual Intel Arc GPUs with llama.cpp SYCL were mirroring VRAM 1:1 to system RAM via DMA-buf/TTM paths
  • zeMemAllocDevice() uses SVM/P2P paths with 500x less system RAM usage (8MB vs 4GB for 4GB VRAM)
  • Fix enables stable multi-GPU inference without requiring massive system memory upgrades

Why It Matters

Enables cost-effective dual-GPU LLM inference setups without requiring excessive system RAM, making local AI more accessible.