Developer Tools

Llama.cpp b9112 fixes CUDA audio limit, enabling 11s+ clips

Audio models can now process over 65535 samples without crashing on CUDA.

Deep Dive

The open-source LLM inference framework llama.cpp has released version b9112, addressing a critical CUDA limitation in its im2col and im2col_3d kernels. The bug occurred because CUDA caps grid dimension Y at 65535, causing launch failures for audio encoders processing raw 16kHz clips longer than ~4 seconds. For example, a SEANet encoder on 11-second audio would hit OW (output width) of 176000, tripping the limit with an 'invalid configuration argument' error.

The fix applies a two-step approach: clamping block_nums.y to MIN(OW, MAX_GRIDDIM_Y) and looping inside the kernel with a stride pattern already used for the Z axis. This ensures bit-identical output for clips under 65535 samples and correct processing for longer audio. The patch was tested on NVIDIA T4 and Jetson Orin hardware. Additionally, the release includes build artifacts for 30 platforms spanning macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x), Windows (x64, arm64, CUDA, Vulkan), Android (arm64), and openEuler—confirming broad compatibility.

Key Points
  • Fixes CUDA grid Y limit (65535) in im2col and im2col_3d kernels for long audio inputs
  • Enables SEANet encoders on 11-second, 16 kHz audio (OW=176000) on T4 and Jetson Orin
  • Bit-identical output for short clips; tested across 30+ platforms including macOS, Linux, Windows, and Android

Why It Matters

Removes a hard cap on audio length for CUDA-based LLM audio processing, enabling longer context inputs.