Doctor or Patient? Synergizing Diarization and ASR for Code-Switched Hinglish Medical Conditions Extraction
An open-source AI system beat 25 competitors to accurately extract medical data from messy, bilingual doctor-patient conversations.
A consortium of researchers has developed a state-of-the-art AI system that can accurately parse and extract medical information from the notoriously difficult audio of real-world, bilingual clinical conversations. The system, detailed in a new arXiv paper, tackles the DISPLACE-M dataset, which features Hinglish (Hindi-English) dialogues with rapid turn-taking and highly overlapped speech between doctors and patients. To solve this, the team created a two-part architecture: first, an End-to-End Neural Diarization with Vector Clustering (EEND-VC) model precisely identifies who is speaking (doctor or patient) even during dense overlaps. Second, for transcription, they adapted a Qwen3 Automatic Speech Recognition (ASR) model through domain-specific fine-tuning, Devanagari script normalization, and dialogue-level error correction using a Large Language Model (LLM).
This cascaded, text-based approach achieved a remarkably low transcription error rate (tcpWER) of 18.59% on the challenging dataset. The researchers benchmarked their open system against proprietary, multimodal end-to-end models that process audio directly for information extraction. While the proprietary models set the performance ceiling, the team's open architecture proved highly competitive, ultimately securing first place out of 25 participants in the official DISPLACE-M challenge. All code and models from this winning entry have been made publicly available, providing a significant open-source resource for the speech processing and medical AI communities. The work demonstrates that well-designed, modular open-source systems can achieve top-tier results on complex, real-world problems where data is messy, code-switched, and acoustically challenging.
- The system uses a novel EEND-VC model for superior speaker diarization, accurately resolving who is speaking in dense, overlapped conversations.
- A fine-tuned Qwen3 ASR model, enhanced with LLM-based error correction, achieved a 18.59% word error rate on code-switched Hinglish medical audio.
- It won 1st place in the DISPLACE-M challenge, beating 24 other entries and proving open-source models can rival proprietary tech for this task.
Why It Matters
It enables accurate, automated medical note-taking from real bilingual consultations, improving healthcare efficiency and record-keeping in multilingual regions.