Dex2HOI generates bimanual two-object interactions at 540x speedup
New diffusion model handles two objects with both hands in real-time from text.
Most 4D human-object interaction (HOI) generation research focuses on single-object manipulation, ignoring how humans naturally use both hands to coordinate multiple objects simultaneously. Researchers from Imperial College London and Tencent AI Lab have developed Dex2HOI, a diffusion model that generates full-body motion with both hands interacting with one or two objects from text prompts. At its core is a Dual-Stream Diffusion framework: each object has a dedicated interaction stream, and the streams communicate via bidirectional cross-attention to ensure spatial and temporal coordination. To produce the final motion, a Motion Fusion Network combines the streams using hand-relative object representations and contact-aware conditioning applied across the entire sequence.
Dex2HOI achieves up to 540x faster inference than prior state-of-the-art methods by sampling the diffusion process autoregressively over prefix-conditioned windows, eliminating costly test-time optimization. It handles both single- and two-object benchmarks, achieving top quantitative results while running in real-time. The model can generate arbitrarily long sequences, making it suitable for applications in animation, robotics, and VR. Code and models will be released upon acceptance, marking a significant step toward expressive multi-object manipulation that mirrors real human behavior.
- Uses a Dual-Stream Diffusion approach where each object gets a dedicated stream coordinated via bidirectional cross-attention.
- Achieves up to 540x inference speedup over prior methods by avoiding test-time optimization.
- Generates arbitrarily long sequences in real-time from text, handling both single- and two-object interactions.
Why It Matters
Enables more natural, real-time animation and robotic manipulation of multiple objects with both hands.