Research & Papers

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

New triple-annotation dataset reveals AI's struggle with Chinese metaphor and sarcasm.

Deep Dive

A research team led by Junzhao Zhang has introduced CFMS, the first fine-grained multimodal sarcasm detection benchmark specifically designed for Chinese social media. The dataset comprises 2,796 high-quality image-text pairs and employs a novel triple-level annotation framework. This framework requires models to not only identify sarcasm but also recognize its target and generate an explanation, moving beyond coarse-grained yes/no classification. The team also curated a parallel Chinese-English metaphor subset of 200 entries each, which revealed significant limitations in current models' metaphoric reasoning, highlighting a key cultural and linguistic challenge for AI.

To address the limitations of traditional retrieval methods for selecting contextual examples, the researchers proposed a Reinforcement Learning-augmented In-Context Learning strategy called PGDS. This method dynamically optimizes exemplar selection for tasks, allowing the AI to learn more effectively from the provided data. Extensive experiments demonstrated that the PGDS method significantly outperforms existing baselines on key sarcasm understanding tasks. The CFMS benchmark and its accompanying methodology provide a solid, explainable foundation for building more reliable and culturally-aware multimodal AI systems capable of parsing complex, context-dependent communication like sarcasm.

Key Points
  • CFMS is the first fine-grained dataset for Chinese multimodal sarcasm, with 2,796 annotated image-text pairs.
  • It uses a triple-annotation framework (sarcasm ID, target, explanation) and exposes AI's weakness in Chinese metaphor.
  • The new PGDS method uses RL to optimize in-context learning, outperforming previous baselines.

Why It Matters

Enables more culturally-aware AI for content moderation, social analysis, and human-computer interaction in Chinese digital spaces.