Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned
Real-world tests show top AI navigation models fail in repetitive spaces and crash frequently.
A team from Polytechnique Montréal and Mila conducted the first comprehensive real-world evaluation of five leading Visual Navigation Models (VNMs): GNM, ViNT, NoMaD, NaviBridger, and CrossFormer. Moving beyond simple success rates, the study assessed these models across two robot platforms and five indoor/outdoor environments using path-based metrics and vision-based goal recognition. The zero-shot tests uncovered critical, systematic failures that are masked in simulated benchmarks, revealing a significant gap between lab performance and real-world utility.
The analysis pinpointed three core limitations. First, even sophisticated diffusion and transformer-based architectures exhibited frequent collisions, indicating a lack of true geometric understanding. Second, models struggled in repetitive environments like office hallways, failing to distinguish between perceptually similar locations with subtle semantic differences. Third, performance degraded sharply under controlled distribution shifts, such as the introduction of motion blur or sunflare into the visual input. The researchers argue that current evaluation practices are insufficient and will publicly release their codebase and dataset to enable more reproducible and rigorous benchmarking of future navigation AI.
- Five top VNMs (GNM, ViNT, NoMaD, NaviBridger, CrossFormer) were tested in real-world zero-shot scenarios across five environments.
- Key failures included frequent collisions (poor geometry understanding) and errors in repetitive spaces due to similar visual features.
- Performance degraded under image perturbations like motion blur, highlighting robustness issues not caught by standard success-rate metrics.
Why It Matters
This exposes a critical reliability gap for deploying AI navigation in real-world robots, from warehouses to autonomous vehicles.