DuoGesture splits gesture generation into semantic and beat streams
DuoGesture's dual-stream AI creates biomechanically plausible, intelligible co-speech gestures.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
DuoGesture is a novel dual-stream AI system for generating co-speech gestures—hand and body movements that accompany speech. Existing holistic models mix lexically grounded semantic gestures with rhythmic beat gestures, limiting semantic grounding and kinematic smoothness. The DuoGesture architecture decomposes synthesis into two coupled streams: a semantic stream responsible for meaning-driven gestures, and a beat stream for prosody-aligned rhythmic motion. A Semantic Variational Information Bottleneck acts as a stochastic frame-level gate, learning when semantic gestures should override rhythmic beats. The semantic stream is further enhanced by Motion-Grounded Semantic Conditioning, replacing pure word embeddings with motion-language representations to better handle long-tailed lexical triggers.
The beat stream is regularized by an Inertial Beat Prior, an anthropometry-weighted arm-chain module that reduces jitter and improves rhythmic consistency without constraining semantic frames. Objective evaluations and subjective human experiments show DuoGesture outperforms strong holistic baselines. Component ablations confirm the complementary roles of semantic grounding, stochastic stream selection, and biomechanical regularization. The work is published on arXiv (2605.26236) and spans computer vision and speech processing.
- Dual-stream architecture separates semantic (meaning) from beat (rhythm) gestures.
- Semantic Variational Information Bottleneck learns frame-level override of beat by semantic streams.
- Inertial Beat Prior reduces jitter and improves rhythmic smoothness without limiting semantic expressivity.
Why It Matters
More natural virtual agents and animators can now generate intelligible, smooth co-speech gestures automatically.