Robotics

Do Open-Loop Metrics Predict Closed-Loop Driving? A Cross-Benchmark Correlation Study of NAVSIM and Bench2Drive

A cross-benchmark study of 8 methods reveals open-loop scores don't match closed-loop results.

Deep Dive

A new paper on arXiv (2605.00066) by Yiru Wang and colleagues investigates whether open-loop metrics—fast and reproducible but not truly interactive—can predict closed-loop driving performance in autonomous driving planners. The authors cross-referenced published results from 15 state-of-the-art methods across two benchmarks: NAVSIM v2 (open-loop) and Bench2Drive (closed-loop). They compiled paired data for 8 methods and found that NAVSIM's aggregate PDM Score has a strong but non-monotonic correlation with Bench2Drive's Driving Score. This means ranking inversions occur: a planner that scores high in open-loop may rank lower when actually driving in a closed-loop simulation.

Among individual sub-metrics, Ego Progress (EP) proved to be the strongest single predictor of closed-loop success, significantly outperforming the safety-critical collision metric NC. The study also reveals a safety-progress trade-off: methods that prioritize safety at the expense of progress rank highly in NAVSIM but underperform in closed-loop due to timeout and slow-driving penalties. Notably, a simpler 3-metric formula (excluding time-to-collision and comfort, which are near saturation) matches the predictive power of the full 5-metric PDMS at Spearman ρ=0.90. The authors propose the "snowball effect"—small open-loop deviations compounding into closed-loop failures—as a key mechanism for the remaining gap.

Key Points
  • NAVSIM's PDM Score shows strong but non-monotonic correlation with Bench2Drive Driving Score, leading to ranking inversions among 8 methods.
  • Ego Progress (EP) is the strongest single predictor of closed-loop success, outperforming collision metric NC.
  • A simpler 3-metric formula (excluding TTC and Comfort) achieves the same predictive power (Spearman ρ=0.90) as the full 5-metric PDMS.

Why It Matters

This research challenges autonomous driving evaluation standards, potentially reshaping how planners are validated before real-world deployment.