:malloc_device with zeMemAllocDevice, cutting system RAM from 60GB to 6.7GB on dual Intel Arc Pro B70 for a 15.6 GiB model

Includes fallback to original SYCL path, CMake flag (GGML_SYCL_SUPPORT_LEVEL_ZERO), and runtime env var (GGML_SYCL_ENABLE_LEVEL_ZERO)?

Includes fallback to original SYCL path, CMake flag (GGML_SYCL_SUPPORT_LEVEL_ZERO), and runtime env var (GGML_SYCL_ENABLE_LEVEL_ZERO)

CI now builds and tests Level Zero path on Ubuntu and Windows; development assisted by Claude Opus 4.6?

CI now builds and tests Level Zero path on Ubuntu and Windows; development assisted by Claude Opus 4.6

Developer Tools

llama.cpp b9145 slashes multi-GPU RAM usage with Level Zero fix

llama.cpp Releases May 14, 2026

⚡New release cuts system RAM from 60GB to 6.7GB on dual Intel Arc GPUs.

Deep Dive

llama.cpp's b9145 release addresses a critical SYCL backend memory issue on multi-GPU Intel Arc systems. The root cause was sycl::malloc_device, which triggered a DMA-buf/TTM path that mirrored every VRAM allocation 1:1 in system RAM. On a dual Intel Arc Pro B70 system (64GB VRAM, 64GB RAM), loading a 15.6 GiB model consumed 60 GiB of system RAM — causing OOM crashes. The fix swaps to zeMemAllocDevice, which uses the SVM/P2P path with no host staging, reducing system RAM usage to ~6.7 GiB with no performance regression.

The implementation includes automatic fallback to the original SYCL allocation path if Level Zero interop is unavailable, CMake build flag (GGML_SYCL_SUPPORT_LEVEL_ZERO) and runtime toggle (GGML_SYCL_ENABLE_LEVEL_ZERO). The code was refined via review feedback: removed try/catch, added device type checks, deduplicated helpers, and ensured Level Zero path only applies to dGPU-to-dGPU transfers. CI now installs the Level Zero SDK for both Ubuntu and Windows builds. Two bugs were also fixed: filtering CPU devices from Level Zero backend checks and routing missed alloc paths (tensor reorder temp buffers) through the new functions. Development was assisted by Claude Opus 4.6 and co-authored with community contributor @arthw.

Key Points

Replaces sycl::malloc_device with zeMemAllocDevice, cutting system RAM from 60GB to 6.7GB on dual Intel Arc Pro B70 for a 15.6 GiB model
Includes fallback to original SYCL path, CMake flag (GGML_SYCL_SUPPORT_LEVEL_ZERO), and runtime env var (GGML_SYCL_ENABLE_LEVEL_ZERO)
CI now builds and tests Level Zero path on Ubuntu and Windows; development assisted by Claude Opus 4.6

Why It Matters

Enables efficient multi-GPU LLM inference on Intel hardware without excessive system RAM consumption, unlocking larger local models.

Read Original Article

llama.cpp b9145 slashes multi-GPU RAM usage with Level Zero fix

Why It Matters

Related Articles

🚀 Stay Ahead in AI