b8956
New operators and fused kernels boost performance on Ascend NPUs.
The llama.cpp project's b8956 release brings substantial upgrades to the CANN backend, which enables large language model inference on Huawei's Ascend NPUs. New operators include GGML_OP_SET, CUMSUM, FILL, DIAG, TRI (with lower/upper variants), SOLVE_TRI, and SOFTPLUS, each implemented via specific Ascend Compute Library (ACL) functions like aclnnInplaceCopy and aclnnCumsum. These additions expand the range of models and operations that can be efficiently executed on Ascend hardware.
Optimizations in this release focus on reducing kernel launch overhead and improving numerical accuracy. GLU variants (SwiGLU, GeGLU, etc.) are now fused into single aclnnSwiGlu/aclnnGeGluV3 calls when applicable. CROSS_ENTROPY_LOSS, previously requiring five separate kernels, is now computed with a single aclnnSoftmaxCrossEntropyWithLogits call. A critical bug fix in the ACL graph cache resolves an issue where F16 and BF16 tensors (sharing the same nb[0]=2) could incorrectly share cached graphs, leading to errors of up to 679 in output values. The fix ensures that op_params are fully compared, preventing incorrect cache hits.
- New operators: GGML_OP_SET, CUMSUM, FILL, DIAG, TRI, SOLVE_TRI, and SOFTPLUS
- GLU fusion: SwiGLU/GeGLU now use single aclnnSwiGlu/aclnnGeGluV3 calls
- Bug fix: ACL graph cache now correctly distinguishes F16/BF16 tensors, preventing errors up to 679
Why It Matters
Expands efficient LLM inference on Ascend NPUs, critical for cost-effective deployment in China and other markets.