Audio & Speech

LLMs beat acoustic models at detecting political emotion in speech

Gemini 2.5 Flash correlates strongly (ρ=0.664) with political pathos; acoustic models fail.

Deep Dive

Juergen Dietrich's new arXiv paper investigates whether acoustic emotion recognition (SER) models can effectively measure the Pathos dimension in political speech—something previously operationalized by the TRUST multi-agent LLM pipeline. Using 51 segments (245 seconds) from a Bundestag plenary speech by Felix Banaszak, the study compares three modalities: emotion2vec_plus_large (an acoustic SER model with circumplex projection), Gemini 2.5 Flash analyzing both audio and transcript, and the TRUST-Pathos scores from a three-advocate LLM supervisor ensemble.

Results show Gemini Valence correlates strongly with TRUST-Pathos (Spearman ρ=+0.664, p<0.001), while emotion2vec Valence shows no significant relationship (ρ=+0.097). A further quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini reveals that standard SER benchmarks are compromised by acted speech, cultural bias, and category incompatibility. The findings suggest LLM-based multimodal analysis captures semantically defined political emotion far better than acoustic models alone, though acoustic features remain useful for low-level arousal estimation. Future work will extend to video-based analysis including facial expression and gaze.

Key Points
  • Gemini 2.5 Flash correlates with political pathos at ρ=+0.664, while acoustic model emotion2vec scores only ρ=+0.097
  • Standard SER benchmarks like EMO-DB suffer from acted speech, cultural bias, and category incompatibility
  • Acoustic features remain useful for low-level arousal but not for semantically defined political emotion

Why It Matters

LLM multimodal analysis offers a much better way to gauge emotional persuasion in political speeches than traditional acoustic models.