Speaker Attributed Automatic Speech Recognition Using Speech Aware LLMS
Researchers adapt a speech-aware LLM to identify and label different speakers in a transcript, outperforming traditional methods.
A team of researchers, including Hagai Aronowitz, Zvi Kons, Avihu Dekel, George Saon, and Ron Hoory, has published a paper detailing a significant upgrade to Speaker-Attributed Automatic Speech Recognition (SAA). Their work extends Granite-speech, a state-of-the-art speech-aware Large Language Model originally designed for transcription and translation. The core achievement is adapting this single model to not only transcribe speech but also accurately attribute each segment to the correct speaker, producing transcripts formatted like '[Speaker 1]: Hello [Speaker 2]: Hi'.
The key technical contribution is the introduction of 'speaker cluster identification tags' (e.g., [Speaker 1 cluster 42]). These tags are learned jointly with the transcription task during training, allowing the model to develop a more nuanced understanding of speaker identity. To overcome a lack of diverse, labeled multi-speaker training data, the team developed a data augmentation method using artificially concatenated conversations. This unified, end-to-end approach eliminates the error propagation common in traditional systems, which perform speaker diarization and speech recognition as separate, sequential steps. The result is a model that demonstrates superior performance across established benchmarks, proving the efficacy of using a single, powerful LLM for this complex audio processing task.
- Adapts IBM's Granite-speech LLM into a unified model for transcription and speaker identification, moving beyond simple ASR.
- Introduces 'speaker cluster tags' (e.g., [Speaker 1 cluster 42]) trained jointly with the model to improve attribution accuracy.
- Uses synthetic multi-speaker data for augmentation and outperforms conventional diarization+ASR pipelines on multiple benchmarks.
Why It Matters
This enables more accurate, automated transcription of meetings, interviews, and calls, saving time on manual speaker labeling and improving AI meeting assistants.