TCG CREST System Description for the DISPLACE-M Challenge
A new AI system achieves a 9.21% diarization error rate in noisy rural healthcare settings, beating baselines by 39%.
Researchers from TCG CREST have detailed their AI system for the DISPLACE-M challenge, which focuses on the critical task of speaker diarization in noisy, real-world medical environments. Their report compares two core frameworks: a modular pipeline using SpeechBrain with ECAPA-TDNN embeddings and a more advanced, state-of-the-art hybrid system called Diarizen, which is built on top of the powerful pre-trained WavLM model. The challenge's goal is to accurately identify 'who spoke when' in naturalistic medical conversations, a foundational step for creating reliable automated transcripts in rural clinics where background noise is a major obstacle. The team's work specifically evaluates the impact of different voice activity detection methods and advanced clustering algorithms on overall performance.
The experimental results are significant, showing that the Diarizen system provided an approximate 39% relative improvement in the diarization error rate (DER) compared to the SpeechBrain baseline. Their top-performing submission combined the Diarizen framework with Agglomerative Hierarchical Clustering (AHC) and a median filtering technique using a large context window of 29 frames. This configuration achieved a DER of 10.37% on the development set and an impressive 9.21% on the final evaluation set. While the team placed sixth out of eleven participants, the technical advancements demonstrated—particularly the efficacy of the end-to-end Diarizen architecture and novel spectral clustering variants like SC-adapt—represent meaningful progress toward robust, automated medical note-taking in challenging acoustic conditions.
- The Diarizen system, built on WavLM, delivered a 39% relative improvement in diarization error rate over a standard SpeechBrain baseline.
- The best configuration achieved a final diarization error rate of 9.21% on the evaluation set for noisy medical conversations.
- The research tested novel clustering techniques like SC-adapt and SC-PNA, highlighting ongoing optimization for real-world audio.
Why It Matters
This directly improves accuracy of automated medical transcription in noisy clinics, saving time and reducing errors in patient records.