Towards Visual Query Segmentation in the Wild
New AI method segments every pixel of a target object across 1.3M video frames with a single visual query.
A research team has introduced a significant leap in computer vision with Visual Query Segmentation (VQS), a new paradigm that moves beyond simple bounding boxes. The core challenge is to find and precisely segment every pixel-level occurrence of a specific object in a long, untrimmed video, using only a single external image of the target as a query. To ground this research, the team created VQS-4K, a massive, high-quality benchmark containing 4,111 diverse videos spanning over 1.3 million frames and 222 object categories. Each video is meticulously annotated with spatial-temporal 'masklets' for a queried object, providing the first dedicated dataset for this complex task.
To tackle VQS, the researchers developed VQ-SAM, a simple yet powerful method built upon Meta's Segment Anything Model 2 (SAM 2). VQ-SAM innovates with a multi-stage framework that progressively evolves a memory of the target by leveraging both target-specific cues and background 'distractor' information from the video itself. A key component is its Adaptive Memory Generation (AMG) module, which refines the model's understanding over time. In extensive testing on the new VQS-4K benchmark, VQ-SAM achieved promising results and surpassed all existing approaches, demonstrating a viable path forward for this challenging problem. The release of the benchmark, code, and model aims to inspire a wave of new research and practical applications in precise, comprehensive video analysis.
- Introduces Visual Query Segmentation (VQS), a new task for finding and segmenting every pixel-level instance of a queried object in untrimmed video.
- Presents the VQS-4K benchmark: 4,111 videos, 1.3M+ frames, and 222 object categories with high-quality manual annotations.
- Proposes VQ-SAM, a method extending SAM 2 with a multi-stage framework and Adaptive Memory Generation, which outperforms existing models on the new benchmark.
Why It Matters
Enables precise, frame-by-frame object tracking for applications in video editing, surveillance, autonomous systems, and media analysis, moving beyond crude bounding boxes.