Developer Tools

b8603

The update resolves data corruption when tools like Ollama write to the same tensor from multiple threads.

Deep Dive

The open-source project llama.cpp, maintained by ggml-org, has released a significant update (commit b8603) that patches critical race conditions in its CANN backend. The CANN backend enables llama.cpp to run on Huawei's Ascend AI processors. The bugs surfaced when AI serving tools like Ollama performed multi-threaded tensor writes, leading to three specific failures: corrupt data from per-chunk quantization transforms, incorrect operations from incomplete tensor data during weight conversion, and unprotected concurrent access to a global workspace array.

The fix centers on a new 'TensorSetTracker' system that intelligently manages write progress. Instead of processing data chunk-by-chunk as it arrives from different threads, the tracker now accumulates all chunks for a given tensor. It defers critical operations—like quantization format transforms and ND-to-NZ weight conversions—until the entire tensor is complete, then performs a single, correct operation. This, combined with added mutex protection for shared resources, eliminates the data corruption. The update also fixes minor bugs in the L2_NORM operation and improves the ACL graph cache's matching logic to prevent incorrect graph reuse, enhancing overall stability for models running on Ascend hardware.

Key Points
  • Fixes three race conditions in the CANN backend that caused data corruption during multi-threaded tensor writes from tools like Ollama.
  • Introduces a 'TensorSetTracker' to defer quantization transforms and weight conversions until all data chunks are received, ensuring correctness.
  • Patches the ACL graph cache to properly compare operation parameters, preventing incorrect cache hits for ops like POOL_2D and CPY.

Why It Matters

Ensures stability for quantized Llama models running on enterprise Huawei Ascend hardware, a key alternative to NVIDIA GPUs.