Research & Papers

Social Media Data Toolkit: Standardization and Anonymization of Social Network Datasets

New open-source Python framework unifies data from Twitter, Reddit, Mastodon, and more.

Deep Dive

The rapid diversification of social media platforms and increasing API restrictions have made cross-platform research notoriously difficult. Researchers often rely on scraped or historical datasets that lack structural consistency, forcing them to answer three critical questions: what makes platforms different, how were datasets collected, and how can they be aligned for fair analysis? To address this, Ali Najafi, Letizia Iannucci, Mikko Kivelä, and Onur Varol developed the Social Media Data Toolkit (SMDT), an open-source Python framework that standardizes heterogeneous social network datasets into a unified schema of five core entities: Communities, Accounts, Posts, Actions, and Entities. The framework includes a configurable anonymization module to strip personally identifiable information (PII) and an extendable enrichment layer that integrates large language models (LLMs) and network analysis tools for downstream tasks like stance detection and toxicity scoring—all without requiring custom code for each dataset.

SMDT's versatility is demonstrated through four case studies spanning textual content analysis and network analysis across platforms. The tool is designed for researchers at any skill level, with detailed documentation and practical guides. By providing a common data structure and built-in privacy safeguards, SMDT aims to make social media research more reproducible and scalable, especially as platforms continue to restrict official APIs. The project is released as open source, with code and documentation accessible via GitHub and the paper available on arXiv. This toolkit could become a foundational resource for computational social scientists studying misinformation, polarization, and online behavior across multiple platforms.

Key Points
  • Unifies data from diverse platforms into a generic schema of Communities, Accounts, Posts, Actions, and Entities.
  • Configurable anonymization module secures Personally Identifiable Information (PII) for ethical research.
  • Enrichment layer integrates LLMs and network analysis for stance detection and toxicity scoring without custom code.

Why It Matters

Enables reproducible, cross-platform social media research despite API restrictions and data heterogeneity.