NasZip achieves up to 8.4× speedup over CPU and 1.4× over GPU at equal accuracy for ANNS?

NasZip achieves up to 8.4× speedup over CPU and 1.4× over GPU at equal accuracy for ANNS.

Combines near-data processing (NDP) with PCA-based early-exiting and bit-level dynamic-float to reduce memory bandwidth pressure?

Combines near-data processing (NDP) with PCA-based early-exiting and bit-level dynamic-float to reduce memory bandwidth pressure.

Outperforms the state-of-the-art NDP ANNS accelerator ANSMET by 1.69×?

Outperforms the state-of-the-art NDP ANNS accelerator ANSMET by 1.69×.

Research & Papers

NasZip speeds up nearest neighbor search 8.4x with DIMM-based near-data processing

arXiv cs.DC May 22, 2026

⚡New co-design tackles memory-bound ANNS bottleneck in RAG for LLMs using PCA-guided early exiting.

Deep Dive

Retrieval-augmented generation (RAG) relies on fast approximate nearest neighbor search (ANNS), but high-dimensional distance calculations are bottlenecked by memory bandwidth on CPUs and GPUs. Traditional early-exiting (EE) techniques reduce memory accesses by computing partial dimensions, but convergence is too slow. NasZip, by Cheng Zou and colleagues from multiple institutions, introduces a hardware-software co-design that combines DIMM-based near-data processing (NDP) with a novel PCA-guided early-exiting mechanism. Instead of relying solely on partial distances, it uses estimation and correction parameters to approximate full-dimensional distances, enabling earlier exits without accuracy loss. A bit-level NDP-aware dynamic-float scheme further cuts memory access for vector data.

On the hardware side, NasZip employs a data-aware neighbor list mapping strategy that reduces neighbor retrieval latency and inter-channel communication overhead. This is complemented by a dedicated cache that exploits data locality and enhances prefetch efficiency. The co-optimized approach yields speedups of up to 8.4× over CPU baselines and 1.4× over the best GPU implementations at equal accuracy. Compared to ANSMET, a state-of-the-art NDP ANNS accelerator, NasZip achieves a 1.69× performance improvement. These results are significant for any system relying on large-scale vector search, particularly in RAG pipelines where lower retrieval latency can directly reduce hallucinations in LLM outputs.

Key Points

NasZip achieves up to 8.4× speedup over CPU and 1.4× over GPU at equal accuracy for ANNS.
Combines near-data processing (NDP) with PCA-based early-exiting and bit-level dynamic-float to reduce memory bandwidth pressure.
Outperforms the state-of-the-art NDP ANNS accelerator ANSMET by 1.69×.

Why It Matters

Faster, memory-efficient vector search directly accelerates RAG pipelines, reducing LLM hallucinations and enabling real-time AI applications.

Read Original Article

NasZip speeds up nearest neighbor search 8.4x with DIMM-based near-data processing

Why It Matters

Related Articles

🚀 Stay Ahead in AI