Uses a Dual-Stream Diffusion approach where each object gets a dedicated stream coordinated via bidirectional cross-attention?

Uses a Dual-Stream Diffusion approach where each object gets a dedicated stream coordinated via bidirectional cross-attention.

Achieves up to 540x inference speedup over prior methods by avoiding test-time optimization?

Achieves up to 540x inference speedup over prior methods by avoiding test-time optimization.

Generates arbitrarily long sequences in real-time from text, handling both single- and two-object interactions?

Generates arbitrarily long sequences in real-time from text, handling both single- and two-object interactions.

Research & Papers

Dex2HOI generates bimanual two-object interactions at 540x speedup

arXiv cs.CV June 01, 2026

⚡New diffusion model handles two objects with both hands in real-time from text.

Deep Dive

Most 4D human-object interaction (HOI) generation research focuses on single-object manipulation, ignoring how humans naturally use both hands to coordinate multiple objects simultaneously. Researchers from Imperial College London and Tencent AI Lab have developed Dex2HOI, a diffusion model that generates full-body motion with both hands interacting with one or two objects from text prompts. At its core is a Dual-Stream Diffusion framework: each object has a dedicated interaction stream, and the streams communicate via bidirectional cross-attention to ensure spatial and temporal coordination. To produce the final motion, a Motion Fusion Network combines the streams using hand-relative object representations and contact-aware conditioning applied across the entire sequence.

Dex2HOI achieves up to 540x faster inference than prior state-of-the-art methods by sampling the diffusion process autoregressively over prefix-conditioned windows, eliminating costly test-time optimization. It handles both single- and two-object benchmarks, achieving top quantitative results while running in real-time. The model can generate arbitrarily long sequences, making it suitable for applications in animation, robotics, and VR. Code and models will be released upon acceptance, marking a significant step toward expressive multi-object manipulation that mirrors real human behavior.

Key Points

Uses a Dual-Stream Diffusion approach where each object gets a dedicated stream coordinated via bidirectional cross-attention.
Achieves up to 540x inference speedup over prior methods by avoiding test-time optimization.
Generates arbitrarily long sequences in real-time from text, handling both single- and two-object interactions.

Why It Matters

Enables more natural, real-time animation and robotic manipulation of multiple objects with both hands.

Read Original Article

Dex2HOI generates bimanual two-object interactions at 540x speedup

Why It Matters

Related Articles

🚀 Stay Ahead in AI