The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge
Researchers combine 360° video, Whisper, and LLMs to achieve perfect speaker clustering in chaotic social settings.
A team of 22 researchers from USTC-NERCSLIP has detailed their winning approach to the CHiME-9 MCoRec Challenge, a benchmark for recognizing and clustering multiple, overlapping conversations in noisy indoor social settings. The challenge's core difficulty lies in its extreme realism: up to eight speakers engaged in up to four parallel dialogues, with a speech overlap rate exceeding 90%, far surpassing the complexity of traditional single-topic meeting transcription. To solve this, the team proposed a multimodal, cascaded system that uniquely leverages per-speaker visual streams extracted from synchronized 360-degree video alongside single-channel audio, creating a richer data source to disentangle the auditory chaos.
The technical pipeline improves three key components using enhanced audio-visual pretrained models: Active Speaker Detection (ASD), Audio-Visual Target Speech Extraction (AVTSE), and Audio-Visual Speech Recognition (AVSR). Crucially, the AVSR module incorporates OpenAI's Whisper model and large language model (LLM) techniques to refine transcription accuracy. Their best single system achieved a Speaker Word Error Rate (WER) of 32.44%. By applying ROVER (a consensus-based fusion technique) to combine outputs from different system variants, they reduced the WER to 31.40%. Most impressively, their LLM-based zero-shot conversational clustering achieved a perfect speaker clustering F1 score of 1.0, leading to a final Joint ASR-Clustering Error Rate (JACER) of 15.70%. This represents a significant leap toward machines that can understand human social interaction in its natural, messy, and concurrent form.
- System tackles extreme scenario with >90% speech overlap from up to 8 speakers in 4 concurrent conversations.
- Uses a multimodal cascade combining 360° video feeds with audio, enhanced by Whisper and LLMs for transcription refinement.
- Achieved a 31.40% Speaker WER and a perfect 1.0 clustering F1 score using LLM-based zero-shot conversational clustering.
Why It Matters
This breakthrough enables AI to parse real-world, chaotic group interactions, paving the way for next-gen meeting assistants and social AI.