HY-Himmel cuts video tokens 3.6x with hierarchical motion encoding
New framework beats dense video baselines using 3.6x fewer tokens through smart motion encoding
Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost for dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. To address this, researchers from multiple institutions present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to an expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter. This adapter distills motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens, which are injected into the LLM via a differentiable placeholder mechanism after contrastive alignment.
On the Video-MME benchmark, HY-Himmel surpasses the dense 32-frame baseline by +2.3 percentage points (61.2% to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains. The work demonstrates that intelligent motion encoding can dramatically reduce token costs while actually improving accuracy, making long-video understanding more practical for real-world applications like surveillance, content moderation, and video search.
- HY-Himmel uses sparse anchor I-frames for scene layout and a dense compressed-domain tri-stream adapter for motion encoding, reducing context tokens 3.6x.
- Achieves 63.5% accuracy on Video-MME, outperforming a dense 32-frame baseline by +2.3 percentage points.
- The tri-stream design (motion vectors, residuals, I-frame context) is validated as necessary and sufficient through extensive ablations.
Why It Matters
Enables practical long-video understanding with significantly lower token and compute costs, accelerating real-world video AI deployment.