RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments
Robots now understand natural language in 3D from a single camera feed
RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine) introduces a tightly coupled multi-modal fusion approach for open-vocabulary semantic SLAM in dynamic environments. The system operates directly on raw monocular RGB video streams, eliminating the need for calibrated cameras, depth sensors, or pre-initialized poses. It leverages agglomerative foundation models like RADIO to extract vision and language embeddings, which are then integrated with geometric scene information through initialization, optimization, and factor graph connections. Adaptive robust kernels wrap the optimization to handle actively moving objects and agent-displaced scene elements, such as furniture rearranged during ego-centric sessions.
In experiments, RADIO-ViPE achieved state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. This bridges a critical gap for real-world deployment in autonomous robotics and unconstrained in-the-wild video streams, enabling geometry-aware open-vocabulary grounding that associates arbitrary natural language queries with localized 3D regions and objects.
- Operates on raw monocular RGB video with no depth sensors, camera calibration, or pose initialization required
- Tightly couples multi-modal embeddings from foundation models (e.g., RADIO) with geometric data via factor graph optimization
- Achieves state-of-the-art on dynamic TUM-RGBD benchmark, rivaling offline methods that require calibrated inputs
Why It Matters
Enables robots to understand natural language in 3D using just a single camera, removing hardware barriers for real-world deployment.