Sparse3DTrack: Monocular 3D Object Tracking Using Sparse Supervision
New AI framework achieves 15.5% performance gains using just four labeled frames per object track.
A research team has introduced Sparse3DTrack, the first framework for monocular 3D object tracking that operates with sparse supervision. Monocular 3D tracking is crucial for autonomous systems to understand scene dynamics from video, but existing state-of-the-art methods rely on dense, expensive 3D annotations across long sequences. Sparse3DTrack fundamentally addresses this scalability bottleneck by requiring only a handful of labeled frames—at most four ground truth annotations per object track—instead of frame-by-frame labeling.
The framework cleverly decomposes the complex task into two sequential, learnable sub-problems: 2D query matching and 3D geometry estimation. Both components exploit the inherent spatio-temporal consistency of video sequences to amplify a sparse set of human-labeled samples. By learning rich 2D and 3D representations, the model can automatically generate high-quality 3D pseudolabels across entire videos, effectively transforming a few annotations into a dense, usable training signal.
This generated pseudolabel data enables existing, fully-supervised 3D trackers to perform effectively under extreme label scarcity. In extensive experiments on major autonomous driving datasets, the method demonstrated significant improvements, boosting tracking performance by up to 15.50 percentage points (p.p.) on standard benchmarks. The approach represents a major step toward practical and scalable perception for robotics and autonomous vehicles, where collecting dense 3D ground truth is prohibitively costly.
- Reduces annotation needs by ~99%, requiring only four 3D labels per object track versus dense frame-by-frame labeling.
- Improves tracking performance by up to 15.50 percentage points on KITTI and nuScenes benchmarks.
- Enables existing fully-supervised trackers to work with sparse data by generating high-quality 3D pseudolabels across videos.
Why It Matters
Drastically lowers the cost and time to develop perception for autonomous vehicles and robots by minimizing data labeling.