Beyond Caption-Based Queries for Video Moment Retrieval
New architecture fixes critical 'decoder-query collapse' issue plaguing current video AI models.
A research team led by David Pujol-Perich has published a CVPR 2026 paper revealing critical flaws in current Video Moment Retrieval (VMR) systems and proposing a solution. Their work demonstrates that models trained on descriptive captions fail dramatically when faced with real-world search queries, suffering from a 'language gap' due to linguistic under-specification and a 'multi-moment gap' when queries reference multiple events. The researchers created three modified benchmarks from popular datasets to expose these weaknesses, showing existing DETR-based architectures degrade significantly outside their training context.
The team identified 'decoder-query collapse' as the core technical problem, where architectures become ineffective at handling multi-moment queries. Their solution involves architectural modifications that increase active decoder queries, resulting in performance improvements of up to 14.82% mAP_m on standard search queries and 21.83% mAP_m specifically on multi-moment queries. This work represents a significant step toward practical video search systems that understand natural human queries rather than requiring perfectly descriptive captions, with implications for video editing, content moderation, and media analysis tools.
- Identified 'decoder-query collapse' as key failure in DETR architectures for multi-moment video queries
- Created three new benchmarks showing up to 21.83% performance gap on real search queries
- Architectural modifications improve multi-moment search accuracy by 21.83% mAP_m over existing methods
Why It Matters
Enables practical video search with natural language queries instead of requiring perfectly descriptive captions.