Layer-wise Token Compression Boosts Reranker Speed by 116%
New token compression method speeds up document reranking without losing quality.
Deep Dive
Researchers propose Layer-wise Token Compression (LTC) for cross-encoder rerankers. By adaptively pooling tokens at intermediate transformer layers, LTC achieves up to 25% QPS gain on MS MARCO passage ranking and 116% on document ranking. The method also extends to listwise LLM rerankers, and surprisingly, compressed models outperform uncompressed ones on long-document tasks, acting as a regularizer.
Key Points
- LTC applies adaptive token pooling at intermediate transformer layers, not just the input layer.
- On MS MARCO, LTC boosts QPS by 25% for passage ranking and 116% for document ranking.
- Compressed models outperform uncompressed ones on long-document tasks, acting as a regularizer.
Why It Matters
Faster reranking without quality loss means cheaper and more scalable search in production systems.