b8979
New CUDA kernel fusion speeds up SSM inference on NVIDIA GPUs...
Deep Dive
ggml-org's llama.cpp release b8979 (April 29) fuses SSM_CONV, ADD (bias), and SILU operations into a single CUDA kernel. The update is included in prebuilt binaries for macOS (Apple Silicon, Intel, iOS), Linux (CPU, Vulkan, ROCm, OpenVINO, SYCL), Android arm64, and Windows (CPU, CUDA 12/13, Vulkan, SYCL, HIP).
Key Points
- Fuses SSM_CONV, ADD (bias), and SILU into a single CUDA kernel
- Reduces memory bandwidth and kernel launch overhead for SSM models
- Supports macOS, Linux, Android, Windows, and openEuler with prebuilt binaries
Why It Matters
Faster inference for state-space models on NVIDIA GPUs, with broad platform support for local AI deployment.