llama.cpp b9145 slashes multi-GPU RAM usage with Level Zero fix
New release cuts system RAM from 60GB to 6.7GB on dual Intel Arc GPUs.
llama.cpp's b9145 release addresses a critical SYCL backend memory issue on multi-GPU Intel Arc systems. The root cause was sycl::malloc_device, which triggered a DMA-buf/TTM path that mirrored every VRAM allocation 1:1 in system RAM. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), loading a 15.6 GiB model consumed 60 GiB of system RAM — causing OOM crashes. The fix swaps to zeMemAllocDevice, which uses the SVM/P2P path with no host staging, reducing system RAM usage to ~6.7 GiB with no performance regression.
The implementation includes automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable, CMake build flag (GGML_SYCL_SUPPORT_LEVEL_ZERO) and runtime toggle (GGML_SYCL_ENABLE_LEVEL_ZERO). The code was refined via review feedback: removed try/catch, added device type checks, deduplicated helpers, and ensured Level Zero path only applies to dGPU-to-dGPU transfers. CI now installs the Level Zero SDK for both Ubuntu and Windows builds. Two bugs were also fixed: filtering CPU devices from Level Zero backend checks and routing missed alloc paths (tensor reorder temp buffers) through the new functions. Development was assisted by Claude Opus 4.6 and co-authored with community contributor @arthw.
- Replaces sycl::malloc_device with zeMemAllocDevice, cutting system RAM from 60GB to 6.7GB on dual Intel Arc Pro B70 for a 15.6 GiB model
- Includes fallback to original SYCL path, CMake flag (GGML_SYCL_SUPPORT_LEVEL_ZERO), and runtime env var (GGML_SYCL_ENABLE_LEVEL_ZERO)
- CI now builds and tests Level Zero path on Ubuntu and Windows; development assisted by Claude Opus 4.6
Why It Matters
Enables efficient multi-GPU LLM inference on Intel hardware without excessive system RAM consumption, unlocking larger local models.