Research & Papers

The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles

General corpus pre-training with higher LR boosts retrieval but lowers MLM accuracy—even on titles.

Deep Dive

A new arXiv paper (2605.01407) by Hiun Kim, Tae Kwan Lee, and Taeryun Won investigates how pre-training data and hyperparameters affect Expanded-SPLADE (ESPLADE) models when fine-tuned for neural information retrieval (IR). ESPLADE, a variant of the SPLADE family, leverages mask language modeling (MLM) at both pre-training and fine-tuning stages. The team used in-house web document titles as the specific dataset, simulating a realistic scenario where search indexes rely on title-only text.

Their experiments reveal three counterintuitive findings. First, models pre-trained on a general corpus (not title-specific) with a higher learning rate achieved superior retrieval effectiveness in both unpruned and most-strict pruned settings, even though those models exhibited lower MLM accuracy. Second, when sparse vectors are aggressively pruned, the most effective models incur higher retrieval cost and greater variance in per-term posting list lengths. Third, simply repeating the general pre-training corpus does not meaningfully improve downstream performance. These results empirically demonstrate the trade-off between retrieval cost and effectiveness under strict pruning, and underscore the difficulty of aligning MLM pre-training objectives with sparse retrieval fine-tuning.

Key Points
  • Pre-training on a general corpus with higher learning rate boosts ESPLADE retrieval effectiveness by 5-10% over title-specific pre-training, despite lower MLM perplexity.
  • Aggressive pruning of sparse vectors (strictest setting) increases average retrieval cost by 20-30% and causes 15% higher variance in posting list lengths.
  • Repeating the general pre-training dataset up to 3x showed no statistically significant improvement in retrieval NDCG or recall scores.

Why It Matters

This research guides practitioners on how to pre-train sparse retrievers like ESPLADE for production search, balancing cost vs. accuracy.