Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning
New AI system analyzes only the speaker's face and voice to forecast audience reactions, eliminating privacy concerns.
A research team led by Hung-Yue Suen has published a novel 'speaker-centric Emotion AI' system in IEEE Transactions on Computational Social Systems. The core innovation is a dual-model approach that predicts two key audience metrics—affective engagement and vocal attractiveness—using only data from the speaker's side of a video. This addresses a major hurdle in scalable affective computing: the need for intrusive audience-side sensors or feedback. The system was trained and validated on a massive corpus of data from Massive Open Online Courses (MOOCs), making it particularly relevant for educational and professional content creators.
The first model predicts affective engagement by assimilating multimodal features from the speaker, including facial dynamics, eye movement (oculomotor features), speech prosody, and the cognitive semantics of their words. The second model predicts vocal attractiveness based exclusively on acoustic features from the speaker's audio. Crucially, when tested on speaker-independent data (ensuring it works for new people), the models demonstrated high predictive performance with R² scores of 0.85 for engagement and 0.88 for vocal attractiveness. This empirically confirms that speaker-side affect can functionally represent aggregated audience reactions, enabling a new paradigm for privacy-preserving analytics.
This research provides a practical framework for real-time feedback tools. Imagine a teleprompter for a webinar or a recording studio dashboard that gives presenters a live 'engagement score' based on their expressiveness, helping them adjust their delivery on the fly. For platforms hosting instructional videos, this AI could automatically flag content with low predicted engagement for creator review or suggest optimal clips for previews, all without ever analyzing a single viewer's face or reaction.
- Dual-model AI predicts audience engagement (R²=0.85) and vocal attractiveness (R²=0.88) from speaker data alone.
- Trained on a massive MOOC corpus, it analyzes facial dynamics, eye movement, speech prosody, and acoustics.
- Enables scalable, privacy-preserving feedback for video creators by eliminating need for audience-side data collection.
Why It Matters
Enables creators and educators to optimize video content for engagement in real-time, without invading viewer privacy.