Research & Papers

MotionMERGE unifies fine-grained human motion editing and reasoning in one LLM framework

837K atomic triplets enable localized body part control with chain-of-thought reasoning

Deep Dive

MotionMERGE is a new unified framework from researchers at multiple institutions that tackles the fundamental granularity gap in human motion-language models. Existing systems operate at a coarse level, lacking the fine-grained control over specific body parts and temporal sequences needed for animation or interactive applications. MotionMERGE addresses this by explicitly modeling motion at both part-level and temporal granularity within a single large language model (LLM), endowing it with robust priors for precise, localized control.

The framework introduces a novel Reasoning-Aware Granularity-Synergy pre-training strategy that jointly supervises cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. To support this, the team curated MotionFineEdit, a large-scale dataset containing 837K atomic triplets and 144K complex triplets — the first to include fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations. Extensive experiments demonstrate MotionMERGE's superior performance in motion generation, understanding, and editing, as well as compelling zero-shot generalization to other complex motion tasks, marking a significant step toward human-like motion interaction.

Key Points
  • MotionMERGE models motion at part-level and temporal granularity within a single LLM, enabling localized editing and detailed understanding of body parts.
  • The Reasoning-Aware Granularity-Synergy pre-training strategy jointly optimizes cross-granularity alignment, temporal grounding, and chain-of-thought reasoning.
  • MotionFineEdit dataset (837K atomic + 144K complex triplets) is the first to provide fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations.

Why It Matters

Brings precise, reasoning-driven motion control to LLMs, unlocking new possibilities for animation, VR, and human-robot interaction.