NYCU PhD thesis proposes human-level object tracking with enhanced reasoning
How to make AI track objects like humans despite deformation and unseen categories
Shih-Fang Chen's 2026 PhD dissertation from National Yang Ming Chiao Tung University rethinks generic object tracking (GOT) — the task of localizing an arbitrary object in video after a single bounding-box initialization. Current trackers struggle when targets undergo severe deformation, encounter complex distractors, face significant environmental changes, or belong to categories never seen during training. The dissertation identifies generalization and online adaptation as key bottlenecks preventing machine tracking from matching human visual perception, which integrates prior knowledge, spatial geometry, and semantic context to maintain robust continuity.
Chen proposes a systematic framework that enhances three core capabilities: target discrimination (distinguishing the object from similar-looking distractors), robust adaptation (adjusting to appearance and environmental changes on the fly), and geometric reasoning (understanding spatial relationships and object layout). While no benchmark numbers are provided in the abstract, the work is positioned as a step toward human-level perceptual intelligence in video understanding. The full PDF (over 35 MB) is available on arXiv, with substantial overlap noted from a related prior paper (arXiv:2602.14771).
- Dissertation tackles four failure modes: severe deformation, complex distractors, environmental changes, and unseen object categories.
- Proposes three capability enhancements: target discrimination, robust adaptation, and geometric reasoning.
- Aims to close the gap between machine trackers and human visual perception which uses prior knowledge and spatial context.
Why It Matters
This research could unlock robust video tracking for autonomous vehicles, surveillance, and robotics in unpredictable real-world conditions.