Multi-View Based Audio Visual Target Speaker Extraction
New AI framework uses multiple camera angles to isolate a single voice from noisy audio with unprecedented accuracy.
A research team has published a new paper, "Multi-View Based Audio Visual Target Speaker Extraction," introducing a significant leap in isolating a target speaker's voice from a noisy audio mix. The core innovation is the Multi-View Tensor Fusion (MVTF) framework, which addresses a key limitation of existing methods that rely solely on frontal-view videos. By training on synchronized lip videos from multiple angles, the system learns the complex, multiplicative interactions between different visual perspectives of speech articulation.
During the training phase, MVTF uses pairwise outer products to explicitly model correlations between lip embeddings from different views. This learned multi-view knowledge is then distilled into the model. Crucially, at inference, the system supports both single-view and multi-view inputs. In single-view mode, it leverages the multi-view training to achieve significant performance gains over traditional single-view models. When multiple camera angles are available, it further boosts accuracy and robustness, making it highly effective for real-world applications like video conferencing, hearing aids, and surveillance where ideal frontal views are not guaranteed.
- Proposes Multi-View Tensor Fusion (MVTF), a framework that transforms multi-view training into single-view performance gains for speaker extraction.
- Uses pairwise outer products to model multiplicative interactions between lip embeddings from different camera angles during training.
- Demonstrates enhanced robustness and performance in real-world scenarios where non-frontal speaker views are common, with code and demo publicly available.
Why It Matters
This technology could dramatically improve voice isolation in video calls, hearing assistance devices, and security systems, especially in noisy, multi-person environments.