Audio & Speech

Prompting Whisper for Joint Speech Transcription and Diarization

Researchers fine-tuned Whisper to label who spoke when in real-time Dutch medical conversations.

Deep Dive

In a new arXiv preprint (2605.05231), researchers Mariia Zamyrova and Henk van den Heuvel investigated a technique to make OpenAI’s Whisper perform both automatic speech recognition (ASR) and speaker diarization jointly. As part of the MediSpeech project, which aims to transcribe Dutch doctor-patient dialogues in real time, the team found that simply prompting Whisper with text containing speaker labels during inference could already insert those labels into the output with reasonable accuracy.

Building on this, they fine-tuned Whisper on speaker-labeled data using a prompt format inspired by Serialized Output Training (SOT), which produced more consistent speaker IDs across long audio chunks and improved verbatim transcription quality. However, significant challenges emerged: errors in earlier prompts propagate, and overlapping speech yields inaccurate timestamps. The work is to be presented at the Joint Workshop on HSCMA and CHiME 2026 and highlights both the promise and pitfalls of repurposing a single model for dual tasks.

Key Points
  • Whisper can be prompted with speaker labels to produce both transcription and diarization, achieving promising accuracy without heavy modifications.
  • Fine-tuning Whisper with SOT-like prompts improved speaker ID consistency across long audio chunks and verbatim transcription, but error propagation remains a major issue.
  • The model struggles with overlapping speech, generating inaccurate timestamps that degrade diarization performance—a key challenge for real-time medical use.

Why It Matters

Enables real-time, speaker-aware transcription in clinical settings, but accuracy hurdles must be solved for deployment.