Research & Papers

Towards Visual Query Segmentation in the Wild

arXiv cs.CV March 11, 2026

⚡New AI method segments every pixel of a target object across 1.3M video frames with a single visual query.

Deep Dive

A research team has introduced a significant leap in computer vision with Visual Query Segmentation (VQS), a new paradigm that moves beyond simple bounding boxes. The core challenge is to find and precisely segment every pixel-level occurrence of a specific object in a long, untrimmed video, using only a single external image of the target as a query. To ground this research, the team created VQS-4K, a massive, high-quality benchmark containing 4,111 diverse videos spanning over 1.3 million frames and 222 object categories. Each video is meticulously annotated with spatial-temporal 'masklets' for a queried object, providing the first dedicated dataset for this complex task.

To tackle VQS, the researchers developed VQ-SAM, a simple yet powerful method built upon Meta's Segment Anything Model 2 (SAM 2). VQ-SAM innovates with a multi-stage framework that progressively evolves a memory of the target by leveraging both target-specific cues and background 'distractor' information from the video itself. A key component is its Adaptive Memory Generation (AMG) module, which refines the model's understanding over time. In extensive testing on the new VQS-4K benchmark, VQ-SAM achieved promising results and surpassed all existing approaches, demonstrating a viable path forward for this challenging problem. The release of the benchmark, code, and model aims to inspire a wave of new research and practical applications in precise, comprehensive video analysis.

Key Points

Introduces Visual Query Segmentation (VQS), a new task for finding and segmenting every pixel-level instance of a queried object in untrimmed video.
Presents the VQS-4K benchmark: 4,111 videos, 1.3M+ frames, and 222 object categories with high-quality manual annotations.
Proposes VQ-SAM, a method extending SAM 2 with a multi-stage framework and Adaptive Memory Generation, which outperforms existing models on the new benchmark.

Why It Matters

Enables precise, frame-by-frame object tracking for applications in video editing, surveillance, autonomous systems, and media analysis, moving beyond crude bounding boxes.

Read Original Article

Towards Visual Query Segmentation in the Wild

Why It Matters

Stay Ahead in AI