BitCPM-CANN brings 1.58-bit LLM training to Huawei Ascend NPUs
First native ternary training on NPU retains 97% performance with 8x memory cut.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
OpenBMB and Huawei have released BitCPM-CANN, the first end-to-end 1.58-bit (ternary) large language model training system designed specifically for Huawei's Ascend NPU platform. By porting their GPU-based pipeline to CANN, MindSpeed, and Megatron-LM, the team trained four ternary models—BitCPM-CANN-0.5B, 1B, 3B, and 8B—using the same architecture and pre-training data as their full-precision MiniCPM4 counterparts. The results are striking: the 1B, 3B, and 8B variants retain 95.7%–97.2% of full-precision performance across 11 benchmarks covering commonsense reasoning, domain knowledge, and mathematics. The 3B model even achieves parity on BBH, and both 3B and 8B recover nearly all GSM8K performance. The 0.5B variant lags slightly at 90.1%, largely due to capacity constraints rather than quantization quality.
On the efficiency front, BitCPM-CANN's ternary training adds just 4.5% throughput overhead (148 vs 155 TFLOP/s per NPU), making it practical as a default configuration. At inference, the approach delivers up to an 8× reduction in weight memory (about 6× end-to-end when including scaling factors). This marks a significant milestone for AI sovereignty, as it provides a reusable low-bit training infrastructure outside the CUDA ecosystem. Notably, MiniCPM4 8B—the full-precision baseline—achieves performance comparable to Qwen3-8B (trained on 36 trillion tokens) using only 8 trillion tokens, underscoring the data efficiency of the underlying architecture when combined with extreme quantization.
- 1.58-bit ternary training on Huawei Ascend NPU for models up to 8B parameters
- Retains 95.7–97.2% of full-precision performance on 11 reasoning and knowledge benchmarks
- Only 4.5% training throughput overhead with 8x weight memory reduction at inference
Why It Matters
Enables high-performance LLM training on domestic NPUs, reducing reliance on CUDA and cutting memory costs by 8x.