ProCrit: Two-agent framework revolutionizes multimodal sarcasm detection
New AI system generates its own reasoning perspectives, then critiques and revises them for better accuracy.
Multimodal sarcasm detection, which requires reasoning over cross-modal incongruities between literal expression and intended meaning, has long struggled with the diversity of sarcastic mechanisms. Existing methods rely on fixed, predefined analytical perspectives that operate independently under hand-crafted routing rules. Researchers from multiple institutions tackle this limitation with ProCrit, a Proposal-Critic two-agent framework. The proposal agent autonomously generates the specific reasoning perspectives needed for each sample, progressively integrating them into a coherent analysis. A separate critic agent independently evaluates the reasoning, identifies deficiencies, and provides natural-language feedback for directed revision.
The system addresses the lack of process-level supervision in existing sarcasm datasets by synthesizing reasoning annotations through a dynamic-role agentic rollout: a strong vision-language model sequentially spawns analytical roles within a shared context, preserving cross-perspective dependencies. ProCrit adopts a draft-critique-revise paradigm and uses a mutual-refinement training framework that jointly optimizes both agents via dual-stage reinforcement learning. The critic agent is refined based on the actual effectiveness of its feedback. Experiments on three widely used benchmarks demonstrate the effectiveness of this approach, outperforming methods that rely on fixed perspectives and predefined routing rules. This work represents a significant step toward more adaptive and robust multimodal reasoning systems.
- ProCrit autonomously generates sample-specific reasoning perspectives instead of relying on fixed, predefined ones.
- A draft-critique-revise paradigm uses an independent critic to provide targeted natural-language feedback for revision.
- Dual-stage reinforcement learning jointly optimizes both the proposal and critic agents, with the critic refined based on feedback effectiveness.
Why It Matters
Advances multimodal AI reasoning by enabling models to adaptively generate and refine analytical perspectives for nuanced tasks like sarcasm detection.