LLMs beat acoustic models at detecting political emotion in speech
Gemini 2.5 Flash correlates strongly (ρ=0.664) with political pathos; acoustic models fail.
Juergen Dietrich's new arXiv paper investigates whether acoustic emotion recognition (SER) models can effectively measure the Pathos dimension in political speech—something previously operationalized by the TRUST multi-agent LLM pipeline. Using 51 segments (245 seconds) from a Bundestag plenary speech by Felix Banaszak, the study compares three modalities: emotion2vec_plus_large (an acoustic SER model with circumplex projection), Gemini 2.5 Flash analyzing both audio and transcript, and the TRUST-Pathos scores from a three-advocate LLM supervisor ensemble.
Results show Gemini Valence correlates strongly with TRUST-Pathos (Spearman ρ=+0.664, p<0.001), while emotion2vec Valence shows no significant relationship (ρ=+0.097). A further quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini reveals that standard SER benchmarks are compromised by acted speech, cultural bias, and category incompatibility. The findings suggest LLM-based multimodal analysis captures semantically defined political emotion far better than acoustic models alone, though acoustic features remain useful for low-level arousal estimation. Future work will extend to video-based analysis including facial expression and gaze.
- Gemini 2.5 Flash correlates with political pathos at ρ=+0.664, while acoustic model emotion2vec scores only ρ=+0.097
- Standard SER benchmarks like EMO-DB suffer from acted speech, cultural bias, and category incompatibility
- Acoustic features remain useful for low-level arousal but not for semantically defined political emotion
Why It Matters
LLM multimodal analysis offers a much better way to gauge emotional persuasion in political speeches than traditional acoustic models.