ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
A discrete diffusion planner that rewrites its own trajectory tokens on the fly...
ReflectDrive-2, developed by a team of researchers including Huimin Wang and colleagues, is a new autonomous driving planner that operates on discrete trajectory tokens using a masked discrete diffusion model. Unlike traditional approaches that require separate refinement networks for error correction, ReflectDrive-2's AutoEdit mechanism allows the same model to rewrite selected tokens directly in the discrete token space. This in-place revision capability is trained in two stages: first, structure-aware perturbations along longitudinal and lateral directions are applied to expert trajectories, and the model learns to recover the original; second, the full decision-draft-reflect pipeline is fine-tuned with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory.
The results on the NAVSIM benchmark are striking. With camera-only input, ReflectDrive-2 achieves a PDMS (probabilistic driving metric score) of 91.0, and in a best-of-6 oracle setting it reaches 94.8 PDMS. Critically, the RL fine-tuning step is essential: under supervised pre-training alone, inference-time AutoEdit improves PDMS by only 0.3 points, but full-rollout RL increases that gain to 1.9 points. The system also co-designs an efficient reflective decoding stack that combines shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking, enabling an average latency of just 31.8 ms on NVIDIA Thor hardware. This work demonstrates how RL-aligned self-editing can make discrete diffusion planners both accurate and fast for real-world driving.
- ReflectDrive-2 uses masked discrete diffusion with AutoEdit to rewrite trajectory tokens without an auxiliary network, enabling in-place self-correction.
- Full-rollout RL fine-tuning boosts AutoEdit's PDMS gain from +0.3 (supervised only) to +1.9, showing the critical role of reinforcement learning.
- Achieves 91.0 PDMS (camera-only) and 94.8 PDMS (best-of-6) on NAVSIM with 31.8 ms latency on NVIDIA Thor.
Why It Matters
ReflectDrive-2 shows RL-driven self-editing can make discrete diffusion planners both more accurate and real-time capable for autonomous vehicles.