DraDDP: First multimodal dataset for multi-party dialogue parsing
495 TV drama segments, 6,374 utterances, and 9.1 hours of video…
Multi-party dialogue discourse parsing—identifying dependency structures and relation types between utterances—has long been limited to text-only or two-party conversations. To bridge that gap, a team from Soochow University introduces DraDDP (Multimodal Multi-Party Dialogue Discourse Parsing Dataset), the first public English dataset built from American TV dramas. DraDDP contains 495 dialogue segments totaling 6,374 utterances and 9.1 hours of synchronized video, covering rich multi-party interaction scenarios like group arguments, turn-taking, and non-verbal cues.
Experimental benchmarks on DraDDP demonstrate that incorporating multimodal information—facial expressions, gestures, and scene context—significantly improves the accuracy of both dependency parsing and relation type classification compared to text-only baselines. The dataset comes with annotation guidelines and source code, all released publicly to accelerate research in multimodal dialogue understanding. This work opens the door for more natural AI agents that can parse complex real-world conversations where body language and tone matter as much as words.
- DraDDP is the first public English multimodal dataset for multi-party dialogue discourse parsing, sourced from American TV dramas.
- Includes 495 segments, 6,374 utterances, and 9.1 hours of parallel video content.
- Multimodal input (video+text) outperforms text-only models in both structure detection and relation classification.
Why It Matters
Enables AI to understand real-world group conversations using both words and nonverbal cues, advancing dialogue systems and social robots.