DAJI framework lets humanoid robots anticipate movements from language commands
New hierarchical model achieves 94.42% success rate in anticipatory humanoid control from natural language.
A team of researchers has proposed DAJI (Dynamics-Aligned Joint Intent), a novel hierarchical framework that bridges natural language understanding with anticipatory whole-body control for humanoid robots. The core innovation is a joint-intent interface that explicitly encodes upcoming physical transitions—such as contact changes, weight shifts, and balance adjustments—allowing the robot to move fluidly rather than relying on reactive corrections. DAJI consists of two components: DAJI-Act, which distills a future-aware teacher policy into a deployable diffusion action policy through student-driven rollouts, and DAJI-Flow, an autoregressive generator that produces future intent chunks from language commands and past intent history.
The framework addresses a critical limitation of existing language-conditioned humanoid systems, which typically generate kinematic references that a low-level tracker must fix reactively, or use latent/action policies that don't explicitly model upcoming physical dynamics. By learning to anticipate body state transitions before executing movements, DAJI enables more natural, continuous control from streaming language instructions. Experimental results show strong performance: 94.42% rollout success on HumanML3D-style generation tasks and a subsequence FID of 0.152 on the BABEL benchmark, both surpassing prior approaches. This work, available on arXiv, points toward more intuitive human-robot interaction where natural language can drive complex, physically-aware motions.
- DAJI achieves 94.42% rollout success on HumanML3D-style generation tasks, surpassing existing kinematic-based methods.
- It uses two components: DAJI-Act (diffusion action policy via distillation) and DAJI-Flow (autoregressive intent generation from language).
- The framework models contact changes, support transfers, and balance preparation explicitly, enabling anticipatory control.
Why It Matters
Enables more natural, fluid human-robot interaction by letting humanoids anticipate physical transitions from spoken commands.