First heterogeneous NPU-CPU system for billion-scale vector search using 1-bit quantization (RaBitQ)?

First heterogeneous NPU-CPU system for billion-scale vector search using 1-bit quantization (RaBitQ).

Achieves 3x–62.8x faster index construction and up to 4.6x throughput vs. fastest CPU IVF-RaBitQ?

Achieves 3x–62.8x faster index construction and up to 4.6x throughput vs. fastest CPU IVF-RaBitQ.

Optimizations include fused AI Core operators, load balancing across queries, and pipeline parallelism?

Optimizations include fused AI Core operators, load balancing across queries, and pipeline parallelism.

Research & Papers

Huawei's Ascend-RaBitQ speeds up billion-scale vector search 100x

arXiv cs.IR May 18, 2026

⚡Combines NPU coarse ranking with CPU fine ranking for 4.6x throughput gain.

Deep Dive

Huawei researchers have published Ascend-RaBitQ, a heterogeneous NPU-CPU system that accelerates billion-scale vector similarity search by up to 100x over CPU baselines. The key innovation is a three-stage pipeline: coarse ranking on 1-bit quantized vectors runs on the NPU's AI Cores, leveraging their massive compute density; top-K selection happens on the on-device AI CPU; and fine re-ranking with full-precision vectors occurs on the host CPU. This decoupling allows each stage to use optimal hardware, breaking the long-standing trade-off between accuracy, memory footprint, and performance.

The team introduced four NPU-native optimizations: fused AI Core + AI Vector operators for parallel distance computation, computation flow restructuring to exploit rotation orthogonality, fine-grained index block-level load balancing across queries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask top-K latency. On standard datasets, Ascend-RaBitQ delivered 3.0x to 62.8x faster index construction than CPU baselines, up to 4.6x higher throughput than the fastest CPU IVF-RaBitQ implementation, and over 100x throughput versus the mathematically equivalent CPU baseline. The system also scales to distributed multi-NPU setups, making it suitable for real-world billion-scale AI retrieval tasks.

Key Points

First heterogeneous NPU-CPU system for billion-scale vector search using 1-bit quantization (RaBitQ).
Achieves 3x–62.8x faster index construction and up to 4.6x throughput vs. fastest CPU IVF-RaBitQ.
Optimizations include fused AI Core operators, load balancing across queries, and pipeline parallelism.

Why It Matters

Makes billion-scale vector search practical for real-time AI systems with 100x speedup and lower memory.

Read Original Article

Huawei's Ascend-RaBitQ speeds up billion-scale vector search 100x

Why It Matters

Related Articles

🚀 Stay Ahead in AI