SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
New video AI from UAE researchers solves object tracking drift, improving benchmarks by up to 8.9%.
A research team from UAE University and Khalifa University has introduced SPARROW, a novel architecture designed to solve a critical flaw in video-based Multimodal Large Language Models (MLLMs). Existing models struggle with 'spatial drift' and 'identity switches,' where they lose track of objects as they move, occlude, or reappear across video frames. SPARROW tackles this by injecting temporally aligned referent cues during training and using a dual-prompt system that decodes both bounding boxes ([BOX]) and segmentation masks ([SEG]), fusing geometric precision with semantic understanding.
SPARROW was trained on a curated dataset of 30,646 videos with 45,231 Q&A pairs and operates end-to-end without needing external object detectors, using a class-agnostic SAM2-based proposer. When integrated into three leading open-source video MLLMs—UniPixel, GLUS, and VideoGLaMM—it delivered consistent performance gains. The model achieved improvements of up to +8.9 points on the J&F metric for the Referential Video Object Segmentation (RVOS) benchmark, +5 mIoU on visual grounding, and +5.4 points on the GCG benchmark's CLAIR metric.
The results, accepted for presentation at CVPR 2026, demonstrate that SPARROW's approach to maintaining 'temporal referential consistency' significantly advances the state of the art. By ensuring objects are tracked accurately and consistently across time, the model enables more reliable, detailed, and actionable understanding of dynamic visual scenes, moving beyond static image analysis to robust video comprehension.
- Solves 'spatial drift' in video AI with Target-Specific Tracked Features (TSF) and a dual-prompt ([BOX]/[SEG]) design.
- Trained on 30,646 curated videos, it improved benchmark scores by up to +8.9 J&F on RVOS without external detectors.
- Integrated into three open-source models (UniPixel, GLUS, VideoGLaMM), proving its architecture is a versatile performance booster.
Why It Matters
Enables AI to track objects in video with human-like consistency, critical for autonomous systems, content analysis, and advanced video editing.