Research & Papers

ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions

New CVPR 2026 method uses foundation models to create 3D+time models from monocular footage.

Deep Dive

A research team led by Zikai Wang has developed ArtHOI, a novel computer vision framework that solves the previously unexplored challenge of reconstructing 4D interactions between human hands and articulated objects from single monocular videos. Unlike existing methods limited to rigid objects or requiring multi-view setups, ArtHOI leverages recent foundation models through an optimization-based approach that refines their often-inaccurate priors. The system's key innovation is its ability to ground normalized object meshes in real-world metric space and align hand-object interactions using physical constraints.

ArtHOI introduces two core technical contributions: Adaptive Sampling Refinement (ASR) for optimizing object scale and pose, and a Multimodal Large Language Model (MLLM) guided method for hand-object alignment that uses contact reasoning as optimization constraints. To validate their approach, the researchers created two new benchmark datasets—ArtHOI-RGBD for controlled settings and ArtHOI-Wild for real-world scenarios—demonstrating robustness across diverse objects and interaction types. The work, accepted to CVPR 2026, represents a significant advance in understanding complex manipulations where both hand and object parts move, such as opening laptop lids or operating scissors.

The framework's practical implications are substantial for fields requiring detailed interaction analysis. By eliminating the need for pre-scanned object models or specialized multi-camera setups, ArtHOI makes high-fidelity 4D reconstruction accessible from ordinary video footage. This opens possibilities for applications in robotics training, AR/VR content creation, and human-computer interaction research where understanding precise manipulation dynamics is crucial.

Key Points
  • Reconstructs 4D (3D+time) hand-articulated-object interactions from single RGB videos, eliminating need for multi-view setups or pre-scanned models
  • Integrates priors from multiple foundation models with novel optimization methods including Adaptive Sampling Refinement and MLLM-guided alignment
  • Validated on new datasets ArtHOI-RGBD and ArtHOI-Wild, accepted to CVPR 2026 with code and project page available

Why It Matters

Enables detailed digital reconstruction of complex manipulations from ordinary video, advancing robotics, AR/VR, and interaction analysis.