MoViD: View-Invariant 3D Human Pose Estimation via Motion-View Disentanglement
New framework disentangles viewpoint from motion, achieving 15 FPS on NVIDIA edge devices with 60% less training data.
A research team led by Yejia Liu has introduced MoViD, a breakthrough framework for 3D human pose estimation that solves the long-standing challenge of viewpoint variation. Traditional methods struggle when cameras capture people from unseen angles, requiring massive training datasets and suffering from high latency. MoViD's core innovation is its motion-view disentanglement approach, which separates viewpoint information from actual human motion features using an orthogonal projection module and physics-grounded contrastive alignment. This architectural breakthrough allows the system to maintain accuracy even under severe occlusions and from completely novel camera angles.
For practical deployment, MoViD implements a frame-by-frame inference pipeline with a view-aware strategy that adaptively activates flip refinement based on estimated viewpoint. Evaluated across nine public datasets plus newly collected multiview UAV and gait analysis datasets, the system demonstrates remarkable performance: 24.2% lower pose estimation error than current state-of-the-art methods, robust performance with 60% less training data, and real-time 15 FPS operation on NVIDIA edge computing platforms. The framework includes a specialized view estimator that models key joint relationships to predict viewpoint information, enabling the system to generalize effectively to real-world scenarios where camera positions are unpredictable.
The technology's edge deployment capability represents a significant advancement for applications requiring real-time processing without cloud dependency. From healthcare monitoring systems that track patient movements to human-robot collaboration in manufacturing environments and immersive gaming experiences, MoViD's combination of accuracy, efficiency, and viewpoint invariance addresses critical barriers to practical implementation. The research paper, submitted to arXiv with identifier 2604.03299, demonstrates how disentangling viewpoint from motion features creates a more fundamental understanding of human pose that transcends specific camera configurations.
- Reduces 3D pose estimation error by 24.2% compared to state-of-the-art methods through motion-view disentanglement
- Achieves real-time 15 FPS performance on NVIDIA edge devices with 60% less training data requirement
- Maintains accuracy under severe occlusions and novel camera viewpoints using physics-grounded contrastive alignment
Why It Matters
Enables accurate real-time human motion tracking for healthcare, robotics, and AR/VR without expensive cloud infrastructure or extensive training data.