Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts
Query shifts cause 12 types of perturbations; HAT-VTR reduces hubness by 40%.
A team of researchers from multiple institutions has introduced HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), a novel framework designed to make video-text retrieval (VTR) models robust to real-world query shifts. Modern VTR models excel on in-distribution benchmarks but fail when query distributions deviate from training data, a problem the team systematically evaluates with a new benchmark featuring 12 distinct video perturbation types across 5 severity degrees. Their analysis reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant 'hubs' that attract a disproportionate number of queries, causing sharp performance drops.
To counter this, HAT-VTR leverages two key components: a Hubness Suppression Memory that refines similarity scores to reduce hub dominance, and multi-granular losses that enforce temporal feature consistency across video frames. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios. Accepted at ICLR 2026, this work provides a critical baseline for deploying VTR systems in dynamic environments where query patterns shift unpredictably, such as surveillance, content moderation, or personalized video search.
- Benchmark includes 12 perturbation types (e.g., blur, occlusion, temporal jitter) at 5 severity levels to simulate real-world query shifts.
- HAT-VTR uses Hubness Suppression Memory to refine similarity scores and multi-granular losses for temporal consistency.
- Accepted at ICLR 2026; outperforms prior methods across all tested query shift scenarios.
Why It Matters
Makes video retrieval reliable in unpredictable environments like surveillance or content moderation.