Research & Papers

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

New vision model handles dynamic scenes and partial occlusions by predicting tracking models from corrupted frames.

Deep Dive

A research team from Academia Sinica and National Yang Ming Chiao Tung University has introduced GOT-JEPA, a novel framework for generic object tracking that tackles two persistent challenges: poor generalization to unseen scenarios and coarse occlusion handling. The system builds on Meta's Joint-Embedding Predictive Architecture (JEPA), originally designed for self-supervised learning from images, but extends it to predict entire tracking models. In this setup, a teacher predictor generates pseudo-tracking models from clean video frames, while a student predictor learns to replicate these models from corrupted versions of the same frames. This approach provides stable supervision and explicitly trains the model to maintain reliability under adverse conditions like occlusions and visual distractions.

The researchers further enhanced the system with OccuSolver, a dedicated module for fine-grained occlusion perception. OccuSolver adapts a point-centric tracker to perform object-aware visibility estimation, capturing detailed occlusion patterns. It works iteratively, using object priors generated by the main tracker to refine visibility states and produce higher-quality reference labels, which in turn improve subsequent model predictions. This closed-loop refinement allows the system to reason about occlusions at a much finer granularity than previous methods. Extensive testing on seven established tracking benchmarks demonstrated that GOT-JEPA significantly improves both generalization and robustness in dynamic, real-world environments where objects frequently disappear and reappear.

Key Points
  • Extends Meta's JEPA architecture to predict tracking models, not just image features, improving adaptation to new scenes.
  • Uses a teacher-student training paradigm with corrupted frames to explicitly learn robustness to occlusions and distractors.
  • Integrates the OccuSolver module for iterative, fine-grained occlusion perception, outperforming existing trackers on seven benchmarks.

Why It Matters

Enables more reliable AI for autonomous vehicles, surveillance, and robotics where objects are frequently obscured.