Collecting Prosody in the Wild: A Content-Controlled, Privacy-First Smartphone Protocol and Empirical Evaluation
A new smartphone method captures speech patterns for AI training while deleting raw audio immediately.
A team of researchers from LMU Munich and the University of St. Gallen has published a novel method for collecting high-quality speech data for AI training while maintaining strict user privacy. The protocol, detailed in a paper submitted to Interspeech 2026, tackles the longstanding challenge of gathering natural prosodic data—the rhythm, stress, and intonation of speech—which is often confounded by semantic content and hampered by privacy concerns. Their solution uses a smartphone app that prompts users to read scripted sentences aloud, standardizing the lexical content and emotional valence of the prompts to isolate prosodic variation.
The system's key innovation is its privacy-by-design architecture. All audio processing happens directly on the user's device. The raw audio recording is immediately deleted after prosodic features are extracted, and only those anonymized, numerical features are transmitted for analysis. The researchers empirically validated the protocol in a large-scale study with 560 participants, who contributed a total of 9,877 recordings. Diagnostic tests on the extracted features showed they contained meaningful signal, successfully predicting both speaker sex and self-reported momentary affective states like valence and arousal. This demonstrates the method can capture psychologically relevant vocal patterns without ever storing a user's actual voice.
- Protocol uses scripted read-aloud sentences on smartphones to standardize content across 560 participants.
- Performs on-device feature extraction and deletes raw audio, transmitting only derived data for privacy.
- Generated 9,877 recordings used to successfully predict speaker sex and affective states (valence, arousal).
Why It Matters
Enables ethical, large-scale collection of vocal emotion data to train more nuanced and private speech AI models.