New research teaches ASR to recognize laughter, coughs, and cries
Sparse annotations of non-verbal sounds overcome with three data-centric strategies
A new paper from an international team of researchers (Gene Yang, Haibin Wu, and others) addresses a major blind spot in modern automatic speech recognition (ASR) systems: non-verbal vocalizations (NVs) such as laughter, breaths, coughs, and cries. These sounds carry crucial conversational and affective information but are typically omitted because NV annotations are sparse and follow a long-tail distribution—common events like laughter dominate, while rare ones like crying are poorly recognized. The study, submitted to arXiv in July 2026, introduces three data-centric strategies to improve low-resource NV recognition.
First, a two-stage curriculum approach: ASR models first learn to map all NVs to a single generic token, then fine-tune on specific target categories. Second, inter-token transfer leverages shared acoustic structure from high-resource events (e.g., laughter, breath) to improve rare event detection (e.g., crying). Third, voice-conversion augmentation with class balancing synthetically generates diverse NV examples. Experiments demonstrate that these methods significantly boost rare-category detection without harming lexical transcription accuracy. The work opens the door to ASR systems that understand not just words, but the complete human vocal signal.
- Two-stage curriculum maps all NVs to a generic token before fine-tuning on specific categories
- Inter-token transfer uses acoustic similarity between common and rare vocalizations to improve rare event detection
- Voice-conversion augmentation with class balancing generates synthetic NV examples to address data sparsity
Why It Matters
ASR systems become more context-aware, capturing emotional and conversational cues beyond words.