Image & Video

Hand trajectory fusion boosts egocentric video grounding by +4.32%

New method uses hand skeletons to understand what you're doing in first-person video.

Deep Dive

Egocentric Natural Language Query (NLQ) grounding requires a model to find the exact moment in a long first-person video that answers a free-form text query. Existing methods rely on fusing video appearance features with the query, but they ignore the rich information in hand motion. This is a significant gap because analysis of the Ego4D dataset shows that roughly 41% of NLQ queries are answered precisely at moments of hand-object manipulation or immediately after. Hand movements contain vital cues about what the user is doing, grasping, or assessing.

The team introduces a hand-trajectory encoder that takes a sequence of hand skeleton positions and outputs highly-semantic kinematic features. These are then aligned and fused with pretrained video-and-text features using a cross-attention strategy with adaptive gating, allowing the model to dynamically weigh hand motion versus appearance. On the Ego4D NLQ v2 validation set, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3). The work is accepted as a poster at the EgoVis Workshop in conjunction with CVPR 2026, highlighting a new direction for grounding first-person video.

Key Points
  • Hand trajectory encoder converts hand skeletons into kinematic features for NLQ grounding
  • Gains of +2.54 on Hand-Object Interaction and +4.32 on Quantity/State queries (R1@IoU=0.3)
  • Accepted at CVPR 2026 EgoVis Workshop; 41% of Ego4D queries involve hand-object manipulation

Why It Matters

Enables more precise AI understanding of human actions from first-person video, improving AR/assistive tech.