Audio & Speech

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

Researchers use Qwen3-Next-80B in a multi-pass system to tackle 30%+ error rates in clinical speech.

Deep Dive

A research team from LaTIM and DySoLab has published a novel method for dramatically improving the accuracy of transcribing French medical conversations, where automatic speech recognition (ASR) systems often struggle with error rates exceeding 30%. Their paper, 'Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization,' introduces a multi-pass architecture that uses a large language model (LLM) to post-process raw ASR output. The system specifically tackles two challenging, real-world clinical datasets: suicide prevention telephone counseling and preoperative awake neurosurgery consultations, where accurate transcription and correct speaker attribution (diarization) are critical for patient care and medical documentation.

The technical core of the method involves alternating, iterative passes between two LLM-powered tasks: refining speaker labels and correcting word recognition errors. The team conducted ablation studies to optimize key design choices, including model selection, prompting strategy, and the order of operations. Using Alibaba's Qwen3-Next-80B model, they achieved statistically significant reductions in Word Diarization Error Rate (WDER) for suicide prevention calls, while maintaining stable performance on neurosurgery consultations. Crucially, the system produced zero output failures and operated with a computational cost (Real-Time Factor of 0.32) deemed acceptable for offline clinical use. This work provides a practical blueprint for deploying LLMs to enhance mission-critical speech applications in specialized, non-English domains where data is scarce and errors are costly.

Key Points
  • Uses Qwen3-Next-80B in an iterative, multi-pass architecture to post-process ASR output for speaker diarization and word correction.
  • Tested on two sensitive French clinical datasets: suicide prevention calls (n=18) and awake neurosurgery consultations (n=10).
  • Achieved significant WDER reduction with zero failures and a Real-Time Factor of 0.32, enabling feasible offline deployment.

Why It Matters

Enables accurate, automated medical note-taking in French, reducing clinician burden and improving documentation for sensitive psychiatric and surgical care.