AVI-HT fuses egocentric video with on-glove 6-DoF IMU signals using a cross-sensor deep attention mechanism?

AVI-HT fuses egocentric video with on-glove 6-DoF IMU signals using a cross-sensor deep attention mechanism.

Reduces mean keypoint error by 16.1% (24.2% wrist-aligned) over existing single- and multi-modal methods?

Reduces mean keypoint error by 16.1% (24.2% wrist-aligned) over existing single- and multi-modal methods.

Introduced the DexGloveHOI dataset with 100K+ synchronized vision-IMU samples for hand-object interaction tasks?

Introduced the DexGloveHOI dataset with 100K+ synchronized vision-IMU samples for hand-object interaction tasks.

Research & Papers

AVI-HT fuses vision and IMU for 3D hand tracking, cuts error 24%

arXiv cs.CV May 22, 2026

⚡New cross-sensor attention model handles occluded hand-object interactions with 100K+ samples.

Deep Dive

AVI-HT, developed by a team of researchers (Ziyi Kou, Ankit Kumar, Mia Huang, et al.), tackles one of the hardest problems in computer vision: tracking 3D hand poses when hands are manipulating objects and partially occluded. The system fuses two complementary modalities—egocentric video from a head-mounted camera and 6-DOF inertial measurement unit (IMU) data from sensors worn on each finger of a glove. A key innovation is a cross-sensor deep attention mechanism that learns to assign varying levels of trust to visual cues versus each individual IMU signal depending on occlusion and motion context. This adaptive fusion allows AVI-HT to maintain accurate tracking even when the hand is obscured by an object.

The authors created the DexGloveHOI dataset, containing over 100,000 synchronized vision-IMU samples with ground-truth 3D poses from a motion-capture system, covering daily tasks like grasping cups, turning keys, and typing. Compared to single-modal baselines (pure vision or pure IMU) and multi-modal fusion models without adaptive attention, AVI-HT achieved a 16.1% reduction in mean keypoint error and a 24.2% reduction for the wrist-aligned variant. Ablation studies revealed per-finger IMU contributions and sensitivity to noise and temporal misalignment, showing the approach is robust but not invincible. The work opens the door to more reliable hand tracking for VR/AR, robotic teleoperation, and sign language interpretation.

Key Points

AVI-HT fuses egocentric video with on-glove 6-DoF IMU signals using a cross-sensor deep attention mechanism.
Reduces mean keypoint error by 16.1% (24.2% wrist-aligned) over existing single- and multi-modal methods.
Introduced the DexGloveHOI dataset with 100K+ synchronized vision-IMU samples for hand-object interaction tasks.

Why It Matters

Reliable hand tracking under occlusion enables better VR/AR interaction, robotic teleoperation, and sign language translation.

Read Original Article

AVI-HT fuses vision and IMU for 3D hand tracking, cuts error 24%

Why It Matters

Related Articles

🚀 Stay Ahead in AI