UniCon3R: Contact-aware 3D Human-Scene Reconstruction from Monocular Video
New model fixes floating bodies and scene penetration in real-time 4D reconstruction from monocular video.
A research team from the Max Planck Institute for Intelligent Systems (MPI-IS) and the National University of Singapore (NUS) has unveiled UniCon3R (Unified Contact-aware 3D Reconstruction), a breakthrough framework for creating physically plausible 3D reconstructions of humans interacting with their environment from ordinary monocular video. The system operates in real-time, performing online 4D reconstruction (3D over time) to jointly recover high-fidelity scene geometry and spatially aligned 3D human poses. Its core advancement is moving beyond simply predicting human-scene contact as an auxiliary metric; instead, it actively uses inferred 3D contact as a corrective signal to refine the final human pose estimation.
This contact-as-correction paradigm directly addresses a major flaw in prior feed-forward methods, which often produced artifacts like bodies floating above the ground or unnaturally penetrating furniture and walls. By modeling physical interaction as an internal prior, UniCon3R ensures the reconstructed humans are grounded in the scene. Extensive experiments on standard benchmarks—including RICH, EMDB, 3DPW, and SLOPER4D—demonstrate that UniCon3R outperforms state-of-the-art baselines in both physical plausibility and the accuracy of global human motion estimation.
The work establishes a new paradigm for physically grounded joint reconstruction, where understanding interaction is central to the model's function, not just a post-hoc evaluation. This has significant implications for applications requiring digital humans that behave realistically within virtual or augmented environments, from film and game development to robotics simulation and AR/VR experiences.
- Uses 3D human-scene contact as an active corrective cue, not just a passive output, to eliminate floating/penetration artifacts.
- Achieves state-of-the-art results on physical plausibility and motion estimation across RICH, 3DPW, SLOPER4D, and EMDB benchmarks.
- Performs real-time, online 4D reconstruction from a single video feed, jointly outputting scene geometry and aligned 3D humans.
Why It Matters
Enables realistic digital human creation for film, gaming, and the metaverse directly from video, crucial for immersive experiences and simulation.