PerSoMed: A Large-Scale Balanced Dataset for Persian Social Media Text Classification
New dataset uses ChatGPT-assisted annotation to train models with 96% F1 scores on Persian social media text.
Researchers Isun Chehreh and Ebrahim Ansari have introduced PerSoMed, the first large-scale balanced dataset specifically designed for Persian social media text classification, addressing a significant gap in Persian natural language processing resources. The dataset comprises 36,000 meticulously curated posts evenly distributed across nine distinct categories: Economic, Artistic, Sports, Political, Social, Health, Psychological, Historical, and Science & Technology, with 4,000 samples per category to ensure balanced class distribution.
The creation process involved collecting 60,000 raw posts from various Persian social media platforms, followed by rigorous preprocessing and a novel hybrid annotation approach combining ChatGPT-based few-shot prompting with human verification. To combat class imbalance, the team employed undersampling with semantic redundancy removal and advanced data augmentation strategies integrating lexical replacement and generative prompting techniques.
Benchmarking results demonstrate transformer-based models consistently outperform traditional neural networks. The Persian-specific TookaBERT-Large model achieved the highest performance with precision of 0.9622, recall of 0.9621, and F1-score of 0.9621. While models showed robust performance across all categories, social and political texts exhibited slightly lower scores due to inherent ambiguity in these domains. The dataset's public availability establishes a solid foundation for advancing Persian NLP applications including social media trend analysis, user behavior modeling, and content classification systems.
- First balanced Persian social media dataset with 36,000 posts across 9 categories (4,000 each)
- Hybrid annotation uses ChatGPT few-shot prompting + human verification on 60,000 raw posts
- TookaBERT-Large achieves 0.962 F1 score, outperforming BiLSTM and other transformer models
Why It Matters
Enables accurate Persian social media analysis for 110M+ Persian speakers, supporting content moderation and trend detection.