Research & Papers

Beyond Caption-Based Queries for Video Moment Retrieval

New architecture fixes critical 'decoder-query collapse' issue plaguing current video AI models.

Deep Dive

A research team led by David Pujol-Perich has published a CVPR 2026 paper revealing critical flaws in current Video Moment Retrieval (VMR) systems and proposing a solution. Their work demonstrates that models trained on descriptive captions fail dramatically when faced with real-world search queries, suffering from a 'language gap' due to linguistic under-specification and a 'multi-moment gap' when queries reference multiple events. The researchers created three modified benchmarks from popular datasets to expose these weaknesses, showing existing DETR-based architectures degrade significantly outside their training context.

The team identified 'decoder-query collapse' as the core technical problem, where architectures become ineffective at handling multi-moment queries. Their solution involves architectural modifications that increase active decoder queries, resulting in performance improvements of up to 14.82% mAP_m on standard search queries and 21.83% mAP_m specifically on multi-moment queries. This work represents a significant step toward practical video search systems that understand natural human queries rather than requiring perfectly descriptive captions, with implications for video editing, content moderation, and media analysis tools.

Key Points
  • Identified 'decoder-query collapse' as key failure in DETR architectures for multi-moment video queries
  • Created three new benchmarks showing up to 21.83% performance gap on real search queries
  • Architectural modifications improve multi-moment search accuracy by 21.83% mAP_m over existing methods

Why It Matters

Enables practical video search with natural language queries instead of requiring perfectly descriptive captions.