AffectSpeech: A Large-Scale Emotional Speech Dataset with Fine-Grained Textual Descriptions for Speech Emotion Captioning and Synthesis
A new dataset uses a human-LLM pipeline to label 6 emotional dimensions, enabling more expressive AI voices.
A research team from institutions including the University of Augsburg and the National University of Singapore has published AffectSpeech, a breakthrough dataset designed to solve a core problem in AI speech technology. Most current systems rely on simplistic, predefined emotional categories, which limit their expressiveness and realism. AffectSpeech tackles this by providing a large corpus of human-recorded speech annotated with rich, structured textual descriptions across six complementary dimensions: sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent emotional segments, and semantic content.
To create this detailed dataset at scale, the team developed a novel human-LLM collaborative annotation pipeline. This process combines algorithmic pre-labeling, multi-LLM description generation, and rigorous human verification to ensure both quality and scalability. The annotations are also reformulated into diverse descriptive styles to reduce linguistic bias in AI models. In experiments, models trained on AffectSpeech consistently outperformed others in both speech emotion captioning (describing the emotion in a clip) and speech emotion synthesis (generating speech with a specific emotional quality), demonstrating its practical utility for building more nuanced and controllable voice AI.
- Dataset annotates speech across 6 dimensions: sentiment, open-vocabulary captions, intensity, prosody, segments, and semantics.
- Uses a scalable human-LLM collaborative pipeline for annotation, balancing quality and volume.
- Enables superior performance in downstream tasks like emotion captioning and emotional speech synthesis.
Why It Matters
This dataset is a foundational resource for creating AI voices with genuine, nuanced emotional expression, crucial for entertainment, customer service, and assistive tech.