Uni-LaViRA navigates robots zero-shot across wheeled, quadruped, humanoid, and UAV
A single architecture controls four robot types on four navigation tasks without any training.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Uni-LaViRA rethinks embodied navigation as a translation problem: language provides semantic directional commands and vision provides pixel-level targets, both handled natively by pretrained multimodal LLMs. This structural insight eliminates the need for robot-specific training data. The architecture extends this to four distinct robot platforms (wheeled, quadruped, humanoid, self-built UAV) and four task families (VLN-CE, ObjectNav, EQA, Aerial-VLN) with zero additional training. Two innovations make this practical: TODO List Memory (TDM) maintains a structured checklist of sub-goals, re-injecting unfinished items into the agent’s attention window at each step, and Second Chance Backtrack (SCB) rolls the robot back to a pre-error state, turning navigation into a self-correcting loop.
Benchmark results are striking: 60.7% success rate (SR) on VLN-CE R2R, 51.3% on RxR, 77.7% on HM3D-v2, 60.0% on HM3D-OVON, 54.7% on MP3D-EQA, and 40.0% on OpenUAV. These numbers rival or surpass foundation models trained on millions of robot trajectories and thousands of GPU-hours. For professionals, this means any organization can deploy a single navigation controller across heterogeneous robot fleets without collecting or labeling new data – a significant step toward plug-and-play robotics.
- Uni-LaViRA achieves zero-shot navigation across four robot types (wheeled, quadruped, humanoid, UAV) and four task families, requiring no training data.
- Key benchmarks: 77.7% SR on HM3D-v2, 60.7% on VLN-CE R2R, and 40.0% on OpenUAV – matching trained models.
- Two agent-loop mechanisms – TODO List Memory (TDM) for sub-goal tracking and Second Chance Backtrack (SCB) for error recovery – make unified zero-shot navigation practical.
Why It Matters
Enables any robot fleet to navigate new environments without costly training data or fine-tuning.