Two-stage curriculum maps all NVs to a generic token before fine-tuning on specific categories?

Two-stage curriculum maps all NVs to a generic token before fine-tuning on specific categories

Inter-token transfer uses acoustic similarity between common and rare vocalizations to improve rare event detection?

Inter-token transfer uses acoustic similarity between common and rare vocalizations to improve rare event detection

Voice-conversion augmentation with class balancing generates synthetic NV examples to address data sparsity?

Voice-conversion augmentation with class balancing generates synthetic NV examples to address data sparsity

Audio & Speech

New research teaches ASR to recognize laughter, coughs, and cries

arXiv eess.AS July 03, 2026

⚡Sparse annotations of non-verbal sounds overcome with three data-centric strategies

Deep Dive

A new paper from an international team of researchers (Gene Yang, Haibin Wu, and others) addresses a major blind spot in modern automatic speech recognition (ASR) systems: non-verbal vocalizations (NVs) such as laughter, breaths, coughs, and cries. These sounds carry crucial conversational and affective information but are typically omitted because NV annotations are sparse and follow a long-tail distribution—common events like laughter dominate, while rare ones like crying are poorly recognized. The study, submitted to arXiv in July 2026, introduces three data-centric strategies to improve low-resource NV recognition.

First, a two-stage curriculum approach: ASR models first learn to map all NVs to a single generic token, then fine-tune on specific target categories. Second, inter-token transfer leverages shared acoustic structure from high-resource events (e.g., laughter, breath) to improve rare event detection (e.g., crying). Third, voice-conversion augmentation with class balancing synthetically generates diverse NV examples. Experiments demonstrate that these methods significantly boost rare-category detection without harming lexical transcription accuracy. The work opens the door to ASR systems that understand not just words, but the complete human vocal signal.

Key Points

Two-stage curriculum maps all NVs to a generic token before fine-tuning on specific categories
Inter-token transfer uses acoustic similarity between common and rare vocalizations to improve rare event detection
Voice-conversion augmentation with class balancing generates synthetic NV examples to address data sparsity

Why It Matters

ASR systems become more context-aware, capturing emotional and conversational cues beyond words.

Read Original Article

New research teaches ASR to recognize laughter, coughs, and cries

Why It Matters

Related Articles

🚀 Stay Ahead in AI