Robotics

Bench2Drive-VL: Benchmarks for Closed-Loop Autonomous Driving with Vision-Language Models

New benchmark tests VLMs in realistic driving scenarios with severe deviations and cumulative errors.

Deep Dive

A research team from Shanghai Jiao Tong University and the Shanghai AI Laboratory has released Bench2Drive-VL, a groundbreaking benchmark that brings closed-loop evaluation to vision-language models (VLMs) for autonomous driving. Unlike existing open-loop benchmarks that test VLMs with static question-answer datasets, Bench2Drive-VL evaluates models in dynamic, realistic driving simulations where errors accumulate over time. The system is built on the CARLA simulator and introduces DriveCommenter, an automated tool that generates diverse, behavior-grounded question-answer pairs for all driving situations—including severe off-route and off-road scenarios that were previously unassessable.

The benchmark provides a unified protocol that allows modern VLMs like GPT-4V, LLaVA, and others to be directly plugged into the closed-loop environment for comparison with traditional autonomous driving agents. It features a flexible reasoning and control framework supporting multi-format visual inputs (including first-person and third-person views) and configurable graph-based chain-of-thought execution. This enables researchers to test how well VLMs handle out-of-distribution inputs and cumulative errors—critical factors for real-world deployment. All code and annotated datasets are open-sourced, creating a complete development ecosystem for the VLM4AD research community.

Bench2Drive-VL addresses a significant gap in autonomous driving evaluation by moving beyond static scene understanding tests. Traditional open-loop benchmarks fail to assess how VLMs perform when they encounter rare or unexpected situations not present in human-collected training data. By simulating these edge cases in a closed-loop environment, researchers can now measure how vision-language models adapt their reasoning and decision-making over time, providing a more reliable indicator of their potential for real-world driving applications.

Key Points
  • Introduces DriveCommenter for automated generation of diverse QA pairs in CARLA simulation, including severe off-route scenarios
  • Provides unified interface allowing direct comparison of VLMs (like GPT-4V) with traditional autonomous driving agents
  • Features flexible framework supporting multi-format visual inputs and configurable graph-based chain-of-thought execution

Why It Matters

Enables realistic testing of AI driving systems on edge cases and cumulative errors, accelerating development of safer autonomous vehicles.