BitCPM-CANN uses 1.58-bit ternary quantization (values -1, 0, +1) for 4x memory reduction over FP16?

BitCPM-CANN uses 1.58-bit ternary quantization (values -1, 0, +1) for 4x memory reduction over FP16.

Tested on Huawei Ascend 910B, achieving inference speeds comparable to traditional FP16 models?

Tested on Huawei Ascend 910B, achieving inference speeds comparable to traditional FP16 models.

Retains over 96% accuracy on NLP benchmarks despite aggressive compression.

Open Source

r/LocalLLaMA May 22, 2026

⚡A 1.58-bit quantized model achieves 4x memory savings on domestic AI chips.

Deep Dive

New models are being tested on the Huawei Ascend 910B accelerator, according to a post.

Key Points

BitCPM-CANN uses 1.58-bit ternary quantization (values -1, 0, +1) for 4x memory reduction over FP16.
Tested on Huawei Ascend 910B, achieving inference speeds comparable to traditional FP16 models.
Retains over 96% accuracy on NLP benchmarks despite aggressive compression.

Extreme quantization enables LLMs on affordable chips, cutting hardware costs and reducing GPU dependency.