Robotics

Streaming Intent Model SI Achieves Breakthrough in Autonomous Driving Control

New VLA model generates safe, diverse driving actions from reasoned intent, not predefined paths.

Deep Dive

A new paper formalizes action emergence as a target capability for autonomous driving: generating safe, semantically appropriate actions in arbitrary long-tail traffic scenes through reasoning rather than retrieval or interpolation. The authors argue previous paradigms fail—autoregressive decoders collapse multimodality, while diffusion/flow generators lack steerability by intent. Their solution, Streaming Intent, makes driving intent semantically streamed via continuous chain-of-thought and temporally streamed across clips for coherent commitments over the driving horizon. The resulting VLA model, SI, autoregressively decodes a four-step chain-of-thought to emit an intent token, then uses classifier-free guidance on a flow-matching action head with only two denoising steps.

On the Waymo End-to-End benchmark, SI achieves competitive performance with an RFS score of 7.96 on the validation set and 7.74 on the test set. More importantly, the model demonstrates—for the first time in a fully end-to-end VLA—intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans. This arises purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector. The work opens new directions for interpretable, intent-driven autonomous driving systems.

Key Points
  • SI uses a four-step chain-of-thought and intent token to guide a flow-matching action head with just two denoising steps.
  • Achieves RFS scores of 7.96 on Waymo validation and 7.74 on test set, with competitive aggregate performance.
  • First end-to-end VLA to demonstrate intent-faithful controllability: varying intent class yields distinct high-quality plans without pre-built trajectory bank.

Why It Matters

Enables safer, more interpretable autonomous driving where actions are guided by reasoned intent rather than learned mappings.