UniMVU's instruction-aware gating boosts multimodal video AI by 13.5 CIDEr
New framework dynamically weights video, audio, depth to avoid interference — boosting scores by 13.5 CIDEr.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Pre-trained video large language models struggle when videos come with auxiliary streams like audio, depth maps, or dense temporal evidence. Uniform fusion causes modality interference, with irrelevant channels distracting the model. To solve this, researchers from multiple institutions present UniMVU, a unified multimodal video understanding framework that performs instruction-aware fusion via two levels of dynamic gating. Inner-modality gates emphasize salient regions within each modality, while modality-level gates re-weight whole streams — both conditioned on the text instruction to adaptively balance modality importance. For time-aligned streams, UniMVU also adopts a fast-to-slow fusion scheme to reduce redundancy.
Across six challenging benchmarks — AVQA, AVSD, Music-AVQA, ScanQA, SQA3D, and MVBench — UniMVU achieves consistent gains over static-fusion baselines, with improvements as high as 13.5 points on the CIDEr metric. Analysis reveals that the gating mechanism aligns with human-interpretable modality relevance, and ablations confirm the contributions of both inner-modality and modality-level gating. UniMVU provides a simple, unified recipe for instruction-aware multimodal video understanding that scales to diverse modalities without hand-crafted fusion rules.
- Two-level dynamic gating: inner-modality gates focus on salient regions, modality-level gates re-weight entire streams.
- Gating conditioned on text instructions enables adaptive balancing of video, audio, depth, and other modalities.
- Achieves up to 13.5 CIDEr improvement across six benchmarks (AVQA, AVSD, Music-AVQA, ScanQA, SQA3D, MVBench).
Why It Matters
Adaptive fusion of audio, depth, and video streams unlocks richer AI understanding without manual tuning.