ELAN4D predicts future robot keypoint tracks (joints, end-effector) using only forward kinematics from proprioceptive states, requiring no external trackers or reconstruction?

ELAN4D predicts future robot keypoint tracks (joints, end-effector) using only forward kinematics from proprioceptive states, requiring no external trackers or reconstruction.

A lightweight track decoder injects 4D spatio-temporal supervision into VLA action experts during training, then is discarded at inference with zero added cost?

A lightweight track decoder injects 4D spatio-temporal supervision into VLA action experts during training, then is discarded at inference with zero added cost.

Consistent gains over VLA baselines on LIBERO, RoboTwin2.0, and real-world tasks, with up to 30% improvement under camera, background, and layout perturbations?

Consistent gains over VLA baselines on LIBERO, RoboTwin2.0, and real-world tasks, with up to 30% improvement under camera, background, and layout perturbations.

Robotics

ELAN4D adds 4D future tracking to make VLA robot policies 30% more robust

arXiv cs.RO June 01, 2026

⚡No external trackers needed: uses only forward kinematics to predict robot motion.

Deep Dive

ELAN4D, proposed by a team led by Zeyuan He and including Philip Torr, addresses a key limitation of current Vision-Language-Action (VLA) models: their reactive nature. Most VLA policies directly regress actions from current observations without modeling future dynamics, making them brittle under perturbations. ELAN4D introduces embodiment-centric 4D supervision by extracting future 3D displacement tracks of robot keypoints (joints and end-effector) solely from forward kinematics derived from proprioceptive states. This requires no external trackers or scene reconstruction, keeping preprocessing costs negligible. A plug-and-play auxiliary branch with a lightweight track decoder injects this predictive signal into the action expert during training, while gradient isolation preserves the pretrained vision-language backbone. At inference, the decoder is discarded, so the policy interface remains unchanged.

Extensive experiments on LIBERO, LIBERO-Plus, RoboTwin2.0, and real-world manipulation tasks demonstrate consistent improvements over strong VLA baselines. ELAN4D achieves the best overall performance and shows particularly large gains under out-of-distribution perturbations such as camera angle changes, background shifts, and layout alterations. The method is model-agnostic and can be applied to various VLA architectures. The authors argue that explicitly modeling temporal dynamics through embodiment-centric supervision is a key step toward building more robust and generalizable manipulation policies, especially for deployment in unstructured environments where visual conditions vary significantly.

Key Points

ELAN4D predicts future robot keypoint tracks (joints, end-effector) using only forward kinematics from proprioceptive states, requiring no external trackers or reconstruction.
A lightweight track decoder injects 4D spatio-temporal supervision into VLA action experts during training, then is discarded at inference with zero added cost.
Consistent gains over VLA baselines on LIBERO, RoboTwin2.0, and real-world tasks, with up to 30% improvement under camera, background, and layout perturbations.

Why It Matters

Makes robot policies robust to visual changes without extra hardware — a practical step toward deployment in unstructured environments.

Read Original Article

ELAN4D adds 4D future tracking to make VLA robot policies 30% more robust

Why It Matters

Related Articles

🚀 Stay Ahead in AI