Research & Papers

Latent Terms: Dense Retrievers hide BM25-ready Zipfian vocabularies

Zero-training extraction turns dense retrievers into sparse search powerhouses.

Deep Dive

A new paper from Clavié, Lee, Shakir, and Kato reveals that dense retrieval models—both single- and multi-vector—contain latent vocabularies that can be trivially extracted for classical sparse retrieval. The method, called Latent Terms, applies sparse autoencoders to a frozen retriever without any retrieval-specific adjustments. The extracted vocabulary follows approximately Zipfian collection statistics, making it directly suitable for BM25 scoring, a traditional sparse retrieval algorithm. This approach requires no learned expansion objective and no sparse retrieval supervision whatsoever, meaning it can be plugged into any existing dense retriever.

Latent Terms achieves performance comparable to or better than the single-vector scoring methods of its base model and also outperforms comparable SPLADE variants, which are explicitly trained for sparse retrieval. Notably, it substantially outperforms its base model on the LIMIT task, a benchmark designed to highlight the failures of single-vector retrieval. The findings suggest that neural retrievers encode far more expressive and indexable structure than their default scoring functions expose, and that alternative extraction methods can unlock that hidden capability for hybrid or purely sparse search pipelines.

Key Points
  • Sparse autoencoders on frozen dense retrievers extract a Zipfian-distributed latent vocabulary ready for BM25 scoring.
  • No learned expansion or sparse supervision needed—works with any dense retriever out of the box.
  • Matches or beats single-vector scoring and SPLADE, and significantly outperforms on the LIMIT failure-case benchmark.

Why It Matters

Unlocks hybrid retrieval without retraining, merging dense accuracy with sparse efficiency for fail-case robustness.