Open Source

BitCPM-CANN brings 1.58-bit LLM training to Huawei Ascend NPUs

First native ternary training on NPU retains 97% performance with 8x memory cut.

Deep Dive

OpenBMB and Huawei have released BitCPM-CANN, the first end-to-end 1.58-bit (ternary) large language model training system designed specifically for Huawei's Ascend NPU platform. By porting their GPU-based pipeline to CANN, MindSpeed, and Megatron-LM, the team trained four ternary models—BitCPM-CANN-0.5B, 1B, 3B, and 8B—using the same architecture and pre-training data as their full-precision MiniCPM4 counterparts. The results are striking: the 1B, 3B, and 8B variants retain 95.7%–97.2% of full-precision performance across 11 benchmarks covering commonsense reasoning, domain knowledge, and mathematics. The 3B model even achieves parity on BBH, and both 3B and 8B recover nearly all GSM8K performance. The 0.5B variant lags slightly at 90.1%, largely due to capacity constraints rather than quantization quality.

On the efficiency front, BitCPM-CANN's ternary training adds just 4.5% throughput overhead (148 vs 155 TFLOP/s per NPU), making it practical as a default configuration. At inference, the approach delivers up to an 8× reduction in weight memory (about 6× end-to-end when including scaling factors). This marks a significant milestone for AI sovereignty, as it provides a reusable low-bit training infrastructure outside the CUDA ecosystem. Notably, MiniCPM4 8B—the full-precision baseline—achieves performance comparable to Qwen3-8B (trained on 36 trillion tokens) using only 8 trillion tokens, underscoring the data efficiency of the underlying architecture when combined with extreme quantization.

Key Points
  • 1.58-bit ternary training on Huawei Ascend NPU for models up to 8B parameters
  • Retains 95.7–97.2% of full-precision performance on 11 reasoning and knowledge benchmarks
  • Only 4.5% training throughput overhead with 8x weight memory reduction at inference

Why It Matters

Enables high-performance LLM training on domestic NPUs, reducing reliance on CUDA and cutting memory costs by 8x.