Static Scene Reconstruction from Dynamic Egocentric Videos
Researchers solve the 'ghost hand' problem that plagues AR/VR scene mapping from bodycams.
A new research paper tackles a major hurdle in computer vision: creating accurate, static 3D maps from the chaotic, shaky footage of first-person or 'egocentric' videos. These videos, common from bodycams, AR glasses, or robots, are filled with rapid motion and dynamic foreground objects like waving hands, which cause state-of-the-art systems like MapAnything to fail. The artifacts—often 'ghost' geometry from moving objects and catastrophic trajectory drift—render the 3D reconstructions unusable.
The proposed pipeline bridges this gap with two key innovations. First, a mask-aware reconstruction mechanism explicitly suppresses dynamic elements within the model's attention layers, preventing artifacts from contaminating the static background map. Second, it employs a chunked reconstruction strategy with pose-graph stitching, processing long videos in segments and then seamlessly combining them to ensure global consistency and eliminate drift. Tested on the HD-EPIC and indoor drone datasets, the method shows marked improvement in Absolute Trajectory Error (ATE) and produces visually clean geometry.
This work effectively extends the capability of existing foundation models for 3D reconstruction into the challenging domain of long-form, dynamic first-person scenes. It represents a critical step forward for applications that rely on understanding persistent environments from inherently unstable, interactive viewpoints.
- Solves the 'ghost geometry' problem in egocentric 3D reconstruction by using a mask-aware mechanism to suppress dynamic foreground objects like hands.
- Prevents long-term trajectory drift—a failure mode for systems like MapAnything—via a chunked reconstruction strategy with pose-graph stitching.
- Demonstrated on HD-EPIC and drone datasets, significantly improving Absolute Trajectory Error (ATE) for clean static maps from chaotic video.
Why It Matters
Enables robust environment mapping for AR glasses, robotics, and digital twins using real-world, messy first-person video data.