MotionMERGE models motion at part-level and temporal granularity within a single LLM, enabling localized editing and detailed understanding of body parts?

MotionMERGE models motion at part-level and temporal granularity within a single LLM, enabling localized editing and detailed understanding of body parts.

The Reasoning-Aware Granularity-Synergy pre-training strategy jointly optimizes cross-granularity alignment, temporal grounding, and chain-of-thought reasoning?

The Reasoning-Aware Granularity-Synergy pre-training strategy jointly optimizes cross-granularity alignment, temporal grounding, and chain-of-thought reasoning.

MotionFineEdit dataset (837K atomic + 144K complex triplets) is the first to provide fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations?

MotionFineEdit dataset (837K atomic + 144K complex triplets) is the first to provide fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations.

Research & Papers

MotionMERGE unifies fine-grained human motion editing and reasoning in one LLM framework

arXiv cs.CV May 20, 2026

⚡837K atomic triplets enable localized body part control with chain-of-thought reasoning

Deep Dive

MotionMERGE is a new unified framework from researchers at multiple institutions that tackles the fundamental granularity gap in human motion-language models. Existing systems operate at a coarse level, lacking the fine-grained control over specific body parts and temporal sequences needed for animation or interactive applications. MotionMERGE addresses this by explicitly modeling motion at both part-level and temporal granularity within a single large language model (LLM), endowing it with robust priors for precise, localized control.

The framework introduces a novel Reasoning-Aware Granularity-Synergy pre-training strategy that jointly supervises cross-granularity alignment, temporal grounding, localized alignment, motion coherency, and motion-grounded chain-of-thought (CoT) reasoning. To support this, the team curated MotionFineEdit, a large-scale dataset containing 837K atomic triplets and 144K complex triplets — the first to include fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations. Extensive experiments demonstrate MotionMERGE's superior performance in motion generation, understanding, and editing, as well as compelling zero-shot generalization to other complex motion tasks, marking a significant step toward human-like motion interaction.

Key Points

MotionMERGE models motion at part-level and temporal granularity within a single LLM, enabling localized editing and detailed understanding of body parts.
The Reasoning-Aware Granularity-Synergy pre-training strategy jointly optimizes cross-granularity alignment, temporal grounding, and chain-of-thought reasoning.
MotionFineEdit dataset (837K atomic + 144K complex triplets) is the first to provide fine-grained spatio-temporal corrective instructions and motion-grounded CoT annotations.

Why It Matters

Brings precise, reasoning-driven motion control to LLMs, unlocking new possibilities for animation, VR, and human-robot interaction.

Read Original Article

MotionMERGE unifies fine-grained human motion editing and reasoning in one LLM framework

Why It Matters

Related Articles

🚀 Stay Ahead in AI