Audio & Speech

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

LLMs now transcribe who said what, when, in group conversations

Deep Dive

A new framework called DM-ASR (Diarization-aware Multi-speaker ASR) tackles the challenge of transcribing conversations with multiple speakers by leveraging large language models (LLMs) alongside explicit speaker diarization. Proposed by Li Li and colleagues, the system reformulates multi-speaker automatic speech recognition as a multi-turn dialogue generation process. Given an audio chunk and pre-computed diarization results, DM-ASR breaks down transcription into a sequence of structured sub-tasks, each corresponding to a specific speaker within a specific time segment. This approach decouples speaker-temporal information from linguistic content, allowing LLMs to focus on modeling language and long-range dependencies while the diarization system provides reliable speaker identities and segment boundaries.

The framework also introduces an optional word-level timestamp prediction mechanism that interleaves word and timestamp tokens, producing richer structured outputs and improving transcription quality. Experiments on Mandarin and English benchmarks demonstrate that DM-ASR achieves strong performance even with relatively small models and limited training data, remaining competitive with or outperforming existing unified approaches. This work highlights the complementary strengths of diarization systems for structural cues and LLMs for linguistic modeling, offering a practical path forward for multi-speaker ASR without requiring massive datasets or extremely large models.

Key Points
  • DM-ASR reformulates multi-speaker ASR as dialogue generation using speaker- and time-conditioned queries
  • Explicitly decouples speaker-temporal structure from linguistic content, combining diarization cues with LLM reasoning
  • Achieves strong performance on Mandarin and English with small models and limited data, outperforming unified approaches

Why It Matters

Practical multi-speaker transcription for meetings and calls, enabling accurate speaker-attributed transcripts with smaller models