UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
New VLA model adds 3D depth perception to video training, boosting robot manipulation by 40%.
A research team from Georgia Tech and MIT has unveiled UniLACT, a novel Vision-Language-Action (VLA) model that addresses a critical limitation in training robots from video data. Current methods rely on latent action representations learned from unlabeled RGB videos, but these primarily capture appearance-based dynamics and lack explicit 3D geometric structure, hindering performance in precise manipulation tasks. UniLACT introduces depth-aware latent pretraining, enabling downstream robot policies to inherit stronger spatial priors essential for contact-rich interactions. This breakthrough allows robots to learn manipulation skills more effectively from passive video observation without costly manual action labeling.
The technical innovation centers on UniLARN, a unified latent action learning framework that learns a shared embedding space for both RGB and depth data while explicitly modeling their cross-modal interactions. This produces modality-specific and unified latent action representations that serve as pseudo-labels for pretraining the transformer-based UniLACT model. Extensive experiments demonstrate that UniLACT consistently outperforms RGB-only baselines across in-domain and out-of-domain pretraining regimes, as well as on both seen and unseen manipulation tasks. The model's ability to understand 3D structure from video represents a significant step toward more capable and generalizable robot learning systems that can acquire complex skills from diverse visual data sources.
- UniLACT incorporates depth perception into video-based robot learning, addressing the 3D geometric gap in RGB-only methods
- The UniLARN framework creates unified latent actions from RGB and depth data, serving as pseudo-labels for pretraining
- Outperforms existing baselines in both simulation and real-world manipulation tasks, especially for contact-rich scenarios
Why It Matters
Enables robots to learn complex manipulation skills from ordinary videos, reducing reliance on expensive labeled demonstration data.