Audio & Speech

FastMSS simulator sparks synthetic data recipe for multi-talker ASR & diarization

Synthetic-only training now rivals real data, plus FastMSS boosts both tasks

Deep Dive

A new paper by Polok et al. (Brno University, Hitachi, CMU) tackles the shortage of real multi-talker conversational recordings by studying how synthetic data generation affects state-of-the-art systems. They introduce FastMSS, a highly efficient open-source simulator, and test it on leading architectures: DiCoW for multi-talker ASR and Sortformer for speaker diarization. The study systematically examines turn-taking dynamics, source domain choice, acoustic augmentations, and mixing strategies, revealing that optimal simulation recipes are highly task-dependent.

Key findings include: increasing speech overlap boosts ASR performance but degrades diarization accuracy, while using broad source diversity consistently outperforms matching the exact target domain. Most strikingly, synthetic-only training now approaches real-data baselines, and combining simulated data with real recordings yields substantial gains over real-only training for both tasks. The work, submitted to INTERSPEECH 2026, provides practical guidelines for generating effective synthetic conversational datasets, helping the field close the gap between simulated mixtures and real-world interactions.

Key Points
  • FastMSS is a new open-source simulator for generating multi-talker conversational audio with controlled turn-taking.
  • Increasing speech overlap improves multi-talker ASR (DiCoW) but hurts speaker diarization (Sortformer).
  • Synthetic-only training approaches real-data baselines; synthetic+real outperforms real-only for both tasks.

Why It Matters

Paves the way for cheaper, scalable training of voice assistants and meeting transcription systems without expensive real recordings.