Audio & Speech

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

arXiv eess.AS March 31, 2026

⚡A new hierarchical model fuses audio, video, and text to automatically evaluate AI-dubbed content quality.

Deep Dive

A team of researchers has developed a novel AI architecture designed to automatically evaluate the quality of AI-generated dubbed content, a task traditionally reliant on expensive and slow human scoring. The model, presented in a paper accepted at ICASSP 2026, uses a hierarchical approach to fuse complementary data from audio (speaker identity, prosody), video (facial expressions, scene cues), and text (semantic context). To overcome the scarcity of human-labeled data, the team innovated by creating proxy quality scores (proxy MOS) by aggregating objective metrics, with weights optimized via active learning, before fine-tuning with actual human ratings.

Trained on a dataset of 12,000 Hindi-English bidirectional dubbed clips, the architecture employs parameter-efficient LoRA adapters for fine-tuning across modalities. The system progressively fuses features through intra- and inter-modal layers, capturing the multi-dimensional aspects of dubbing quality like synchronization, intelligibility, and emotional alignment. The result is a model that achieves a Pearson Correlation Coefficient (PCC) greater than 0.75 with human perception, demonstrating strong alignment. This provides a practical, scalable solution for developers and platforms to rapidly and consistently assess AI-dubbing output during model training and content production cycles.

Key Points

Uses hierarchical fusion of audio, video, and text data to predict human Mean Opinion Scores (MOS) for AI dubs.
Trained on 12,000 Hindi-English clips and achieves high perceptual alignment (PCC > 0.75).
Employs LoRA adapters for efficient tuning and creates proxy labels via active learning to overcome data scarcity.

Why It Matters

Enables scalable, automated quality control for AI-generated voiceovers and dubbing, accelerating content localization.

Read Original Article

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

Why It Matters

Stay Ahead in AI