MoCHA: Denoising Caption Supervision for Motion-Text Retrieval
New framework uses GPT-5.2 to filter out subjective language from motion descriptions, dramatically improving AI's ability to match text to 3D movement.
A team from Georgia Tech and Google has developed MoCHA, a novel framework that addresses a fundamental flaw in training AI for motion-text retrieval. Current systems learn from motion-caption pairs, but each caption is just one subjective description of a movement, mixing objective facts (like 'right arm raises') with stylistic fluff and inferred context. Standard contrastive training treats each unique caption as the single correct label, creating noisy, inconsistent embeddings that weaken the AI's ability to match text to 3D motion accurately.
MoCHA solves this by acting as a 'denoiser' or canonicalizer. It processes raw captions to extract only the motion-recoverable content—the aspects you could actually see from the 3D joint data—before the text is encoded. This creates tighter, more consistent positive examples during training. The researchers tested two versions: one powered by the large language model GPT-5.2, and a more efficient, distilled FlanT5 model that requires no LLM at inference time.
The results are significant. When applied to the MotionPatches (MoPa) architecture, MoCHA's LLM variant achieved a 13.9% Text-to-Motion Recall@1 score on the HumanML3D benchmark, a 3.1 percentage point improvement, and a massive 10.3 point jump to 24.3% on KIT-ML. Perhaps more importantly, by standardizing the language space, MoCHA made the learned representations far more transferable between different datasets, with cross-dataset performance improving by 94% in one direction and 52% in the other. The framework is designed as a plug-and-play preprocessing step, making it easy to integrate into existing systems for an immediate accuracy boost.
- MoCHA reduces within-motion text embedding variance by 11-19% by filtering subjective language from captions before training.
- The GPT-5.2-powered variant improved Text-to-Motion Recall@1 by 10.3 percentage points on the KIT-ML benchmark.
- The method improved cross-dataset transfer performance by up to 94%, creating more generalizable motion-language AI models.
Why It Matters
Enables more reliable AI for animation, robotics, and AR/VR by creating robust links between language instructions and physical movement.