4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video
New CVPR 2026 paper creates animatable 3D horse avatars from single images using synthetic training data.
A research team from Tsinghua University and collaborating institutions has introduced 4DEquine, a groundbreaking computer vision framework for reconstructing detailed 3D models of horses from ordinary video footage. Accepted to CVPR 2026, the system addresses a critical bottleneck in animal reconstruction by separating the complex 4D (3D + time) problem into two specialized tasks: dynamic motion tracking and static appearance modeling. This architectural choice directly tackles the computational inefficiency and sensitivity to incomplete data that plagued previous joint-optimization methods.
For motion reconstruction, 4DEquine employs a spatio-temporal transformer with a post-optimization stage to generate smooth, pixel-aligned sequences of pose and shape from video. For appearance, its novel feed-forward network can create a high-fidelity, animatable 3D Gaussian avatar from as few as a single input image. To train these components without vast amounts of hard-to-capture real-world data, the team created two large-scale synthetic datasets: VarenPoser for high-quality surface motions and diverse camera trajectories, and VarenTex for realistic multi-view imagery generated via multi-view diffusion models.
Remarkably, despite being trained exclusively on this synthetic data, 4DEquine achieves state-of-the-art performance on established real-world benchmarks like APT36K and AiM. This demonstrates not only the robustness of the model's architecture but also the quality and transferability of the new synthetic datasets. The work represents a significant step toward non-invasive, video-based monitoring tools for animal health and biomechanics.
- Architecture disentangles 4D reconstruction into separate motion (spatio-temporal transformer) and appearance (feed-forward network) models, improving efficiency.
- Appearance network can generate an animatable 3D Gaussian avatar from a single image, dramatically reducing data requirements.
- Trained on novel synthetic datasets VarenPoser & VarenTex, yet sets new benchmarks on real-world APT36K and AiM datasets for geometry and appearance.
Why It Matters
Enables non-invasive, video-based health and gait analysis for horses, advancing veterinary science and animal welfare with accessible technology.