A Voronoi Cell Formulation for Principled Token Pruning in Late-Interaction Retrieval Models
New geometric framework reduces dense retrieval storage overhead while maintaining search accuracy.
A research team including Yash Kankanampati, Yuxuan Zong, Nadi Tomeh, Benjamin Piwowarski, and Joseph Le Roux has published a paper introducing a novel geometric framework for token pruning in late-interaction retrieval models. The work addresses a critical bottleneck in models like ColBERT, which require storing dense embeddings for every document token, creating substantial storage overhead. Previous pruning methods often lacked formal grounding or proved ineffective, but this new approach casts the problem as estimating Voronoi cells in the high-dimensional embedding space.
By interpreting each token's influence through the measure of its Voronoi region—the portion of space closer to that token than any other—the researchers developed a principled method for identifying which tokens can be safely removed. Their experiments demonstrate that this geometric approach maintains retrieval quality while significantly reducing index size. The framework not only serves as a competitive pruning strategy but also provides valuable insights into token-level behavior within dense retrieval systems, offering both practical efficiency gains and improved interpretability for developers working on search and RAG applications.
- Novel geometric framework treats token pruning as Voronoi cell estimation in embedding space
- Addresses storage overhead in ColBERT-style models requiring dense embeddings per token
- Maintains retrieval quality while reducing index size through principled token removal
Why It Matters
Enables more efficient deployment of advanced retrieval models for enterprise search and RAG systems at scale.