Efficient Equivariant Transformer for Self-Driving Agent Modeling
New transformer architecture eliminates quadratic scaling, achieving SE(2)-equivariance without expensive positional encodings.
A research team from Uber ATG and the University of Toronto has unveiled DriveGATr, a new transformer-based architecture designed to model the behavior of agents (like cars and pedestrians) in self-driving systems. The core challenge in this domain is building models that respect fundamental symmetries of the physical world, specifically SE(2)-equivariance—meaning predictions should not change if the entire scene is rotated or translated. While standard transformers handle permutation equivariance, achieving SE(2)-equivariance has traditionally required explicit, pairwise relative positional encodings, a method whose computational cost scales quadratically with the number of agents, severely limiting scalability for complex urban scenes.
DriveGATr's innovation lies in its use of geometric algebra to sidestep this bottleneck. Instead of standard vectors, the model encodes each scene element—vehicles, pedestrians, lanes—as a multivector within the 2D projective geometric algebra ℝ*_{2,0,1}. This mathematical representation inherently encapsulates geometric relationships like orientation and position. These multivectors are then processed through a stack of equivariant transformer blocks. Crucially, because the geometric relationships are baked into the multivectors themselves, the model can use standard, efficient attention mechanisms between them, completely avoiding the need for the expensive quadratic-cost positional encodings used by prior methods.
Evaluated on the challenging Waymo Open Motion Dataset, DriveGATr demonstrates performance comparable to the state-of-the-art in traffic simulation. More importantly, it establishes a superior Pareto frontier, meaning it offers a better trade-off between prediction accuracy and computational expense. This efficiency gain is critical for practical deployment, where self-driving systems must run complex prediction models in real-time, processing scenes with dozens of dynamic agents. The work, accepted at CVPR 2026, represents a significant step toward more scalable and computationally feasible prediction modules for autonomous vehicles.
- Eliminates quadratic-cost positional encodings by encoding scenes as geometric algebra multivectors (ℝ*_{2,0,1}).
- Achieves SE(2)-equivariance—invariance to rotation and translation—crucial for robust real-world predictions.
- Demonstrates superior performance vs. compute trade-off on Waymo Open Motion Dataset, enabling scaling to larger scenes.
Why It Matters
Enables more scalable, real-time prediction for autonomous vehicles by drastically reducing computational overhead for modeling complex traffic scenes.