Research & Papers

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

arXiv cs.CV April 30, 2026

⚡Robots now understand natural language in 3D from a single camera feed

Deep Dive

RADIO-ViPE (Reduce All Domains Into One -- Video Pose Engine) introduces a tightly coupled multi-modal fusion approach for open-vocabulary semantic SLAM in dynamic environments. The system operates directly on raw monocular RGB video streams, eliminating the need for calibrated cameras, depth sensors, or pre-initialized poses. It leverages agglomerative foundation models like RADIO to extract vision and language embeddings, which are then integrated with geometric scene information through initialization, optimization, and factor graph connections. Adaptive robust kernels wrap the optimization to handle actively moving objects and agent-displaced scene elements, such as furniture rearranged during ego-centric sessions.

In experiments, RADIO-ViPE achieved state-of-the-art results on the dynamic TUM-RGBD benchmark while maintaining competitive performance against offline open-vocabulary methods that rely on calibrated data and static scene assumptions. This bridges a critical gap for real-world deployment in autonomous robotics and unconstrained in-the-wild video streams, enabling geometry-aware open-vocabulary grounding that associates arbitrary natural language queries with localized 3D regions and objects.

Key Points

Operates on raw monocular RGB video with no depth sensors, camera calibration, or pose initialization required
Tightly couples multi-modal embeddings from foundation models (e.g., RADIO) with geometric data via factor graph optimization
Achieves state-of-the-art on dynamic TUM-RGBD benchmark, rivaling offline methods that require calibrated inputs

Why It Matters

Enables robots to understand natural language in 3D using just a single camera, removing hardware barriers for real-world deployment.

Read Original Article

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

Why It Matters

Stay Ahead in AI