MultiSoc-4D benchmark reveals LLMs miss 79% of hate speech in Bengali social media
ChatGPT, Gemini, Claude, and Grok all fail at detecting sarcasm with near-zero agreement.
Researchers from Bangladesh University of Engineering and Technology (BUET) and collaborators have introduced MultiSoc-4D, a diagnostic benchmark designed to expose a critical flaw in LLM-based annotation for low-resource languages. The dataset comprises over 58,000 Bengali social media comments sourced from six platforms, annotated across four dimensions: category, sentiment, hate speech, and sarcasm. Using a structured pipeline, the team partitioned the data among ChatGPT, Gemini, Claude, and Grok, with a shared 20% validation set to compare model outputs against human-calibrated references. The results revealed a systematic phenomenon called 'instruction-induced label collapse,' where LLMs consistently default to fallback labels like 'Other,' 'Neutral,' or 'No' to avoid committing to minority categories.
The impact is stark: the models failed to detect 79% of hateful instances and 75% of sarcastic content compared to human-validated labels. Even more troubling, this collapse creates a 'label agreement illusion'—models appear to agree with each other on sarcasm detection, yet statistical analysis using Fleiss' Kappa produced a value of approximately -0.001, indicating no agreement beyond chance. Across more than 40 LLM variants tested, the bias propagated regardless of architecture. The researchers argue that relying on LLM annotations without such diagnostics can silently distort downstream NLP systems, especially for under-resourced languages like Bengali. MultiSoc-4D is released as an open benchmark to help the community identify and mitigate these annotation biases.
- MultiSoc-4D is a 58K+ Bengali social media benchmark testing ChatGPT, Gemini, Claude, and Grok on closed-set annotation.
- LLMs exhibit 'instruction-induced label collapse,' missing 79% of hate speech and 75% of sarcasm due to fallback label preference.
- Sarcasm detection shows near-zero agreement (Fleiss' Kappa ≈ -0.001), revealing a 'label agreement illusion' that hides poor minority-class detection.
Why It Matters
Exposes hidden bias in LLM annotation for low-resource languages, threatening the reliability of scaled NLP datasets.