Audio & Speech

New research teaches ASR to recognize laughter, coughs, and cries

Sparse annotations of non-verbal sounds overcome with three data-centric strategies

Deep Dive

A new paper from an international team of researchers (Gene Yang, Haibin Wu, and others) addresses a major blind spot in modern automatic speech recognition (ASR) systems: non-verbal vocalizations (NVs) such as laughter, breaths, coughs, and cries. These sounds carry crucial conversational and affective information but are typically omitted because NV annotations are sparse and follow a long-tail distribution—common events like laughter dominate, while rare ones like crying are poorly recognized. The study, submitted to arXiv in July 2026, introduces three data-centric strategies to improve low-resource NV recognition.

First, a two-stage curriculum approach: ASR models first learn to map all NVs to a single generic token, then fine-tune on specific target categories. Second, inter-token transfer leverages shared acoustic structure from high-resource events (e.g., laughter, breath) to improve rare event detection (e.g., crying). Third, voice-conversion augmentation with class balancing synthetically generates diverse NV examples. Experiments demonstrate that these methods significantly boost rare-category detection without harming lexical transcription accuracy. The work opens the door to ASR systems that understand not just words, but the complete human vocal signal.

Key Points
  • Two-stage curriculum maps all NVs to a generic token before fine-tuning on specific categories
  • Inter-token transfer uses acoustic similarity between common and rare vocalizations to improve rare event detection
  • Voice-conversion augmentation with class balancing generates synthetic NV examples to address data sparsity

Why It Matters

ASR systems become more context-aware, capturing emotional and conversational cues beyond words.

📬 Get the top 10 AI stories daily