Research & Papers

Huawei's Ascend-RaBitQ speeds up billion-scale vector search 100x

Combines NPU coarse ranking with CPU fine ranking for 4.6x throughput gain.

Deep Dive

Huawei researchers have published Ascend-RaBitQ, a heterogeneous NPU-CPU system that accelerates billion-scale vector similarity search by up to 100x over CPU baselines. The key innovation is a three-stage pipeline: coarse ranking on 1-bit quantized vectors runs on the NPU's AI Cores, leveraging their massive compute density; top-K selection happens on the on-device AI CPU; and fine re-ranking with full-precision vectors occurs on the host CPU. This decoupling allows each stage to use optimal hardware, breaking the long-standing trade-off between accuracy, memory footprint, and performance.

The team introduced four NPU-native optimizations: fused AI Core + AI Vector operators for parallel distance computation, computation flow restructuring to exploit rotation orthogonality, fine-grained index block-level load balancing across queries, and intra-NPU pipeline parallelism between AI Core and AI CPU to mask top-K latency. On standard datasets, Ascend-RaBitQ delivered 3.0x to 62.8x faster index construction than CPU baselines, up to 4.6x higher throughput than the fastest CPU IVF-RaBitQ implementation, and over 100x throughput versus the mathematically equivalent CPU baseline. The system also scales to distributed multi-NPU setups, making it suitable for real-world billion-scale AI retrieval tasks.

Key Points
  • First heterogeneous NPU-CPU system for billion-scale vector search using 1-bit quantization (RaBitQ).
  • Achieves 3x–62.8x faster index construction and up to 4.6x throughput vs. fastest CPU IVF-RaBitQ.
  • Optimizations include fused AI Core operators, load balancing across queries, and pipeline parallelism.

Why It Matters

Makes billion-scale vector search practical for real-time AI systems with 100x speedup and lower memory.