Developer Tools

b8956

llama.cpp Releases April 28, 2026

⚡New operators and fused kernels boost performance on Ascend NPUs.

Deep Dive

The llama.cpp project's b8956 release brings substantial upgrades to the CANN backend, which enables large language model inference on Huawei's Ascend NPUs. New operators include GGML_OP_SET, CUMSUM, FILL, DIAG, TRI (with lower/upper variants), SOLVE_TRI, and SOFTPLUS, each implemented via specific Ascend Compute Library (ACL) functions like aclnnInplaceCopy and aclnnCumsum. These additions expand the range of models and operations that can be efficiently executed on Ascend hardware.

Optimizations in this release focus on reducing kernel launch overhead and improving numerical accuracy. GLU variants (SwiGLU, GeGLU, etc.) are now fused into single aclnnSwiGlu/aclnnGeGluV3 calls when applicable. CROSS_ENTROPY_LOSS, previously requiring five separate kernels, is now computed with a single aclnnSoftmaxCrossEntropyWithLogits call. A critical bug fix in the ACL graph cache resolves an issue where F16 and BF16 tensors (sharing the same nb[0]=2) could incorrectly share cached graphs, leading to errors of up to 679 in output values. The fix ensures that op_params are fully compared, preventing incorrect cache hits.

Key Points

New operators: GGML_OP_SET, CUMSUM, FILL, DIAG, TRI, SOLVE_TRI, and SOFTPLUS
GLU fusion: SwiGLU/GeGLU now use single aclnnSwiGlu/aclnnGeGluV3 calls
Bug fix: ACL graph cache now correctly distinguishes F16/BF16 tensors, preventing errors up to 679

Why It Matters

Expands efficient LLM inference on Ascend NPUs, critical for cost-effective deployment in China and other markets.

Read Original Article

b8956

Why It Matters

Stay Ahead in AI