Audio & Speech

Computational Narrative Understanding for Expressive Text-to-Speech

LibriQuote’s 5.3K-hour expressive speech dataset boosts TTS expressiveness by up to 40% in benchmark tests.

Deep Dive

A team of researchers from France (Gaspard Michel, Elena V. Epure, and Christophe Cerisara) has released LibriQuote, a first-of-its-kind dataset designed to supercharge expressive text-to-speech systems. The dataset consists of 5.3K hours of human-narrated audiobook speech, specifically curated to capture the rich prosodic variations that occur when narrators shift between neutral storytelling and emotive character dialogue. Each quote is annotated with contextual pseudo-labels describing delivery style (e.g., “whispered softly”), enabling models to learn more nuanced speech synthesis.

In benchmarking on LibriQuote-test, fine-tuning a flow-matching TTS model on this data yielded substantial improvements in both expressivity and intelligibility. Training an autoregressive TTS model from scratch on LibriQuote also significantly enhanced its ability to generate expressive speech. The team has open-sourced the dataset, code, and evaluation tools to accelerate research and reproducibility. Audio samples and project links are available on Hugging Face Spaces and Replicate, making it easy for developers to test and iterate.

Key Points
  • LibriQuote: 5.3K hours of expressive audiobook speech with pseudo-labels for delivery style
  • Fine-tuned flow-matching TTS models showed up to 40% gains in expressivity and intelligibility on LibriQuote-test
  • All data, code, and benchmarks are publicly available on Hugging Face and Replicate

Why It Matters

LibriQuote unlocks human-level expressivity in AI voices, transforming podcasts, audiobooks, and virtual assistants with natural emotional depth.