AVTrack introduces a human-centric AVIS dataset with camera motion, occlusions, and position changes?

AVTrack introduces a human-centric AVIS dataset with camera motion, occlusions, and position changes.

Current methods show significant performance drops on AVTrack, highlighting the need for better spatiotemporal modeling?

Current methods show significant performance drops on AVTrack, highlighting the need for better spatiotemporal modeling.

Accepted at ICML 2026, the paper provides an open-source baseline and project website for community use?

Accepted at ICML 2026, the paper provides an open-source baseline and project website for community use.

Research & Papers

AVTrack dataset challenges AI audio-visual tracking in complex real-world scenes

arXiv cs.CV June 03, 2026

⚡New benchmark exposes weaknesses in audio-visual tracking under camera motion and occlusions.

Deep Dive

Audio-visual speaker tracking aims to localize and track active speakers by combining auditory and visual cues, crucial for applications like intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are mostly limited to simple or homogeneous scenes with coarse annotations, biasing evaluation toward static co-occurrence rather than robust spatiotemporal modeling. To address this, researchers from multiple institutions present AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world conditions. AVTrack features diverse challenges including camera motion, visual occlusions, and position changes, making it a more realistic benchmark for robust cross-modal reasoning.

Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, confirming that current models fail under complex, dynamic conditions. This establishes AVTrack as a challenging new benchmark for human-centric audio-visual scene understanding. The research team also provides a simple yet effective baseline to help advance the field. The paper is 19 pages with 10 figures and has been accepted at ICML 2026. The project website and open-source code are available, encouraging further development of practical audio-visual tracking systems.

Key Points

AVTrack introduces a human-centric AVIS dataset with camera motion, occlusions, and position changes.
Current methods show significant performance drops on AVTrack, highlighting the need for better spatiotemporal modeling.
Accepted at ICML 2026, the paper provides an open-source baseline and project website for community use.

Why It Matters

Enables robust audio-visual tracking for real-world applications like surveillance, video editing, and human-computer interaction.

Read Original Article

AVTrack dataset challenges AI audio-visual tracking in complex real-world scenes

Why It Matters

Related Articles

🚀 Stay Ahead in AI