UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval
Uncertainty-based sampling cuts training data by 50% while improving retrieval accuracy...
Researchers from Seoul National University have introduced UnIte (Uncertainty-based Iterative Document Sampling), a novel method for unsupervised domain adaptation in neural information retrieval. The approach tackles the challenge of adapting retrievers to unseen domains without labeled data by intelligently selecting which documents to use for pseudo query generation. Unlike existing diversity-focused sampling methods, UnIte leverages model uncertainty to filter out noisy data (aleatoric uncertainty) and prioritize informative samples (epistemic uncertainty), maximizing the learning utility per document.
In extensive experiments on the BEIR benchmark across small and large models, UnIte achieved significant gains of +2.45 and +3.49 nDCG@10 while using only 4,000 training samples on average—a substantial reduction in data requirements. The method was accepted at ACL 2026 (Findings) and demonstrates that uncertainty-aware sampling can outperform diversity-based approaches for domain adaptation. This work has practical implications for deploying neural retrievers in new domains where labeled data is scarce, potentially enabling faster and more cost-effective adaptation for enterprise search, legal document retrieval, and scientific literature mining.
- UnIte filters documents with high aleatoric uncertainty and prioritizes those with high epistemic uncertainty for pseudo query generation
- Achieves +2.45 and +3.49 nDCG@10 improvements on BEIR benchmark with only 4k training samples on average
- Accepted at ACL 2026 (Findings) and outperforms existing diversity-focused document sampling methods
Why It Matters
Enables efficient domain adaptation for neural retrievers with 50% less data, boosting accuracy for enterprise search and IR systems.