NasZip speeds up nearest neighbor search 8.4x with DIMM-based near-data processing
New co-design tackles memory-bound ANNS bottleneck in RAG for LLMs using PCA-guided early exiting.
Retrieval-augmented generation (RAG) relies on fast approximate nearest neighbor search (ANNS), but high-dimensional distance calculations are bottlenecked by memory bandwidth on CPUs and GPUs. Traditional early-exiting (EE) techniques reduce memory accesses by computing partial dimensions, but convergence is too slow. NasZip, by Cheng Zou and colleagues from multiple institutions, introduces a hardware-software co-design that combines DIMM-based near-data processing (NDP) with a novel PCA-guided early-exiting mechanism. Instead of relying solely on partial distances, it uses estimation and correction parameters to approximate full-dimensional distances, enabling earlier exits without accuracy loss. A bit-level NDP-aware dynamic-float scheme further cuts memory access for vector data.
On the hardware side, NasZip employs a data-aware neighbor list mapping strategy that reduces neighbor retrieval latency and inter-channel communication overhead. This is complemented by a dedicated cache that exploits data locality and enhances prefetch efficiency. The co-optimized approach yields speedups of up to 8.4× over CPU baselines and 1.4× over the best GPU implementations at equal accuracy. Compared to ANSMET, a state-of-the-art NDP ANNS accelerator, NasZip achieves a 1.69× performance improvement. These results are significant for any system relying on large-scale vector search, particularly in RAG pipelines where lower retrieval latency can directly reduce hallucinations in LLM outputs.
- NasZip achieves up to 8.4× speedup over CPU and 1.4× over GPU at equal accuracy for ANNS.
- Combines near-data processing (NDP) with PCA-based early-exiting and bit-level dynamic-float to reduce memory bandwidth pressure.
- Outperforms the state-of-the-art NDP ANNS accelerator ANSMET by 1.69×.
Why It Matters
Faster, memory-efficient vector search directly accelerates RAG pipelines, reducing LLM hallucinations and enabling real-time AI applications.