KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding
New benchmark solves video AI's biggest flaw: describing fine-grained human movements without hallucinating details.
A research team led by Boda Lin has introduced KPM-Bench (Kinematic Parsing Motion Benchmark), a breakthrough dataset addressing a critical weakness in current video AI models. Despite advances in systems like GPT-4V and Claude 3, models consistently fail at accurately describing fine-grained human motions in videos, often hallucinating incorrect limb movements or timing details.
The benchmark was created using an automated annotation pipeline that combines kinematic-based motion computation with linguistic parsing, allowing for detailed decomposition of complex actions into precise limb-level descriptions. KPM-Bench contains three core components: fine-grained video-caption pairs that document limb dynamics, diverse question-answer pairs specifically testing motion understanding, and a carefully curated evaluation set designed to measure hallucination in motion descriptions.
Technically, the team developed the Motion Parsing and Extraction (MoPE) algorithm, which can accurately extract motion-specific attributes directly from textual captions without relying on large vision-language models. This enables a new hallucination evaluation metric that functions independently of existing AI systems. When integrated into the GRPO post-training framework, MoPE significantly reduces hallucination problems, improving motion-centric video captioning reliability.
This work matters because current video understanding models struggle with precisely what humans notice most: the subtle details of movement. Applications from sports analysis and physical therapy to security monitoring and autonomous systems require accurate motion understanding that existing models simply can't provide. KPM-Bench provides the first standardized way to measure and improve this capability across the AI research community.
- KPM-Bench dataset includes fine-grained video-caption pairs documenting limb-level dynamics in complex actions
- MoPE algorithm extracts motion attributes from text, enabling hallucination measurement without large models
- Integrated with GRPO framework, reduces motion description hallucinations by providing precise evaluation metrics
Why It Matters
Enables reliable AI for sports analysis, physical therapy, and security where precise motion understanding is critical.