Huawei's Ascend-RaBitQ speeds up billion-scale vector search 100x
Combines NPU coarse ranking with CPU fine ranking for 4.6x throughput gain.
Huawei researchers have published Ascend-RaBitQ, a heterogeneous NPU-CPU system that accelerates billion-scale vector similarity search by up to 100x over CPU baselines. The key innovation is a three-stage pipeline: coarse ranking on 1-bit quantized vectors runs on the NPU's AI Cores, leveraging their massive compute density; top-K selection happens on the on-device AI CPU; and fine re-ranking with full-precision vectors occurs on the host CPU. This decoupling allows each stage to use optimal hardware, breaking the long-standing trade-off between accuracy, memory footprint, and performance.
The team introduced four NPU-native optimizations: fused AI Core + AI Vector operators for parallel distance computation, computation flow restructuring to exploit rotation orthogonality, fine-grained index block-level load balancing across queries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask top-K latency. On standard datasets, Ascend-RaBitQ delivered 3.0x to 62.8x faster index construction than CPU baselines, up to 4.6x higher throughput than the fastest CPU IVF-RaBitQ implementation, and over 100x throughput versus the mathematically equivalent CPU baseline. The system also scales to distributed multi-NPU setups, making it suitable for real-world billion-scale AI retrieval tasks.
- First heterogeneous NPU-CPU system for billion-scale vector search using 1-bit quantization (RaBitQ).
- Achieves 3x–62.8x faster index construction and up to 4.6x throughput vs. fastest CPU IVF-RaBitQ.
- Optimizations include fused AI Core operators, load balancing across queries, and pipeline parallelism.
Why It Matters
Makes billion-scale vector search practical for real-time AI systems with 100x speedup and lower memory.