Research & Papers

ParsCN: A Persian Dataset for Counter-Narrative Generation to Combat Online Hate Speech

A new 1,100-pair dataset uses a novel LLM-augmented framework to create culturally-aware counter-narratives for Persian.

Deep Dive

Researchers have introduced ParsCN, the first and most comprehensive Persian dataset designed to train AI models to generate counter-narratives against online hate speech. The dataset, created by Zahra Safdari Fesaghandis and Suman Kalyan Maity, consists of 1,100 carefully curated hate speech and counter-narrative pairs. It is uniquely annotated across six specific target groups and six distinct countering strategies, all tailored to the socio-cultural nuances of Persian online discourse. This addresses a critical gap, as low-resource languages like Persian have historically lacked the high-quality data needed for effective, automated moderation tools.

To build ParsCN efficiently, the team developed a novel, scalable framework that blends culturally-informed human annotation with few-shot LLM-augmented generation, using models like GPT-4o and Claude. This hybrid approach, guided by semantic retrieval and rigorous manual curation, enabled the creation of diverse, high-quality responses while significantly reducing traditional annotation costs—establishing a replicable model for other low-resource language settings. In evaluations, human-written counter-narratives scored highest, with GPT-4o and Claude closely following in metrics like relevance and tone appropriateness.

Benchmark tests on models like mBART and PersianMind revealed that existing systems struggle with fluency, cultural nuance, and safety when generating Persian counter-narratives, underscoring the necessity of language-specific resources like ParsCN. By serving as a foundational benchmark, ParsCN aims to advance research in Persian NLP and foster safer, more inclusive digital environments. The work has been accepted for presentation at the International AAAI Conference on Web and Social Media (ICWSM 2026).

Key Points
  • ParsCN is the first Persian dataset for counter-narrative generation, containing 1,100 hate speech/response pairs with annotations for six target groups and strategies.
  • The creation used a novel hybrid framework combining human annotation with few-shot LLM-augmented generation (GPT-4o, Claude), reducing costs and setting a template for other languages.
  • Benchmark tests show existing models (mBART, PersianMind) struggle with cultural nuance, highlighting the need for this Persian-specific resource to build effective moderation AI.

Why It Matters

It provides a crucial, culturally-aware dataset to build AI tools that can combat hate speech in Persian, a major low-resource language online.