BUT System Description for CHiME-9 MCoRec Challenge
Combining NVIDIA Parakeet-v2, AV-HuBERT, and Qwen3.5-122B to decode overlapping conversations.
The CHiME-9 MCoRec task targets the difficult problem of transcribing heavily overlapped multiparty conversations using audio and video. The BUT system tackles this with a single-pass long-context target-speaker AV-ASR model. It embeds visual representations from a pre-trained AV-HuBERT model into the encoder of NVIDIA's Parakeet-v2 ASR architecture, allowing the model to focus on a specific speaker even when multiple people speak simultaneously.
After transcription, the system uses Qwen3.5-122B, a large language model, to estimate topic similarity between utterances, then applies hierarchical agglomerative clustering to group participants into conversation threads. On the development set, this approach achieved 33.7% word error rate (WER) and a clustering F1 score of 0.97 — a dramatic improvement over the official baseline (49.9% WER, 0.82 F1). On the evaluation set, BUT ranked second, only 0.16% WER and 0.5% F1 behind the winning team. The work demonstrates the power of combining state-of-the-art audio-visual ASR with LLM-based semantic reasoning for real-world conversational analysis.
- System uses NVIDIA's Parakeet-v2 ASR conditioned on visual features from AV-HuBERT for target-speaker transcription in overlapping speech.
- Conversation grouping is performed by Qwen3.5-122B LLM analyzing transcript topic similarity, followed by hierarchical agglomerative clustering.
- Achieved 33.7% WER and 0.97 F1 on dev set (16.2% WER and 0.15 F1 better than baseline); ranked 2nd on eval set with only 0.16% WER gap to 1st.
Why It Matters
This approach shows how combining AV-ASR with LLM reasoning can dramatically improve transcription of complex, overlapping conversations—key for meeting analysis, surveillance, and call centers.