Fixes CUDA grid Y limit (65535) in im2col and im2col_3d kernels for long audio inputs?

Fixes CUDA grid Y limit (65535) in im2col and im2col_3d kernels for long audio inputs

Enables SEANet encoders on 11-second, 16 kHz audio (OW=176000) on T4 and Jetson Orin?

Enables SEANet encoders on 11-second, 16 kHz audio (OW=176000) on T4 and Jetson Orin

Bit-identical output for short clips; tested across 30+ platforms including macOS, Linux, Windows, and Android?

Bit-identical output for short clips; tested across 30+ platforms including macOS, Linux, Windows, and Android

Developer Tools

Llama.cpp b9112 fixes CUDA audio limit, enabling 11s+ clips

llama.cpp Releases May 12, 2026

⚡Audio models can now process over 65535 samples without crashing on CUDA.

Deep Dive

The open-source LLM inference framework llama.cpp has released version b9112, addressing a critical CUDA limitation in its im2col and im2col_3d kernels. The bug occurred because CUDA caps grid dimension Y at 65535, causing launch failures for audio encoders processing raw 16kHz clips longer than ~4 seconds. For example, a SEANet encoder on 11-second audio would hit OW (output width) of 176000, tripping the limit with an 'invalid configuration argument' error.

The fix applies a two-step approach: clamping block_nums.y to MIN(OW, MAX_GRIDDIM_Y) and looping inside the kernel with a stride pattern already used for the Z axis. This ensures bit-identical output for clips under 65535 samples and correct processing for longer audio. The patch was tested on NVIDIA T4 and Jetson Orin hardware. Additionally, the release includes build artifacts for 30 platforms spanning macOS (Apple Silicon and Intel), Linux (x64, arm64, s390x), Windows (x64, arm64, CUDA, Vulkan), Android (arm64), and openEuler—confirming broad compatibility.

Key Points

Fixes CUDA grid Y limit (65535) in im2col and im2col_3d kernels for long audio inputs
Enables SEANet encoders on 11-second, 16 kHz audio (OW=176000) on T4 and Jetson Orin
Bit-identical output for short clips; tested across 30+ platforms including macOS, Linux, Windows, and Android

Why It Matters

Removes a hard cap on audio length for CUDA-based LLM audio processing, enabling longer context inputs.

Read Original Article

Llama.cpp b9112 fixes CUDA audio limit, enabling 11s+ clips

Why It Matters

Related Articles

🚀 Stay Ahead in AI