AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models
A new 1M+ dialogue dataset from 13 TV shows aims to make AI voices sound like specific characters.
A research team has introduced AudioRole, a large-scale, multimodal dataset designed to advance character role-playing in AI voice models. Curated from 13 different TV series, the dataset contains over 1 million character-specific dialogues synchronized with their corresponding audio, amounting to more than 1,000 hours of content. It is annotated with speaker identities and contextual metadata, providing a rich resource for training models to understand and mimic not just what is said, but how specific characters say it—their vocal tone, style, and personality.
To benchmark progress, the team also created ARP-Eval, a dual-aspect evaluation framework that assesses both the quality of an AI's response and its fidelity to a character's persona. They demonstrated the dataset's value by training an 'ARP-Model' based on GLM-4-Voice. This model achieved an Acoustic Personalization score of 0.31 and a Content Personalization score of 0.36, significantly outperforming the base GLM-4-Voice model and matching the capabilities of the more advanced MiniCPM-O-2.6 model in role-playing scenarios.
The release includes the dataset itself, six pre-trained ARP-Models for different characters, and the full evaluation protocol. This comprehensive package provides researchers and developers with the essential tools needed to push the boundaries of audio-grounded conversational AI, moving beyond generic text responses to creating immersive, character-driven voice interactions.
- Dataset contains 1M+ dialogues from 115+ characters across 13 TV series, totaling 1,000+ hours of audio-text pairs.
- Introduced ARP-Eval, a new framework to evaluate both response quality and character fidelity in AI voice models.
- A model trained on AudioRole (ARP-Model) outperformed GLM-4-Voice by ~38% in content personalization and beat MiniCPM-O-2.6 in acoustic personalization.
Why It Matters
This dataset enables more realistic and immersive AI companions, chatbots, and interactive media by giving them distinct, consistent character voices.