Contains 75 million comments and 400 million signed votes from a 10-year period (2013-2022)?

Contains 75 million comments and 400 million signed votes from a 10-year period (2013-2022).

Data is anonymized and includes pre-computed text embeddings, not raw text, to preserve user privacy?

Data is anonymized and includes pre-computed text embeddings, not raw text, to preserve user privacy.

Provides threaded conversations, explicit up/downvotes, and topic labels for studying discourse dynamics in German?

Provides threaded conversations, explicit up/downvotes, and topic labels for studying discourse dynamics in German.

Research & Papers

Researchers release massive 10-year dataset of 75M German news forum comments

arXiv cs.SI March 11, 2026

⚡A new dataset captures over 75 million user comments and 400 million votes from a major Austrian newspaper's forum.

Deep Dive

A team of computational social scientists has released a landmark dataset for AI and social network research, capturing a decade of public discourse on the Austrian news platform DerStandard. The dataset, spanning 2013 to 2022, is uniquely rich, containing over 75 million user comments organized into threaded conversations and accompanied by more than 400 million explicit upvotes and downvotes. Crucially, it also includes editorial topic labels, providing ground truth for content analysis. To protect user privacy, all persistent identifiers are anonymized, and instead of raw text, the team provides pre-computed vector representations from a state-of-the-art embedding model.

This resource is a significant leap for research in mid-resourced languages like German, which are often data-poor compared to English. It enables longitudinal studies on how online conversations evolve, how voting patterns correlate with network structure and polarization, and how topics drive user engagement. By providing structured, anonymized data with semantic embeddings, it lowers the barrier for researchers in computational social science, NLP, and network analysis to build and test models of human behavior without privacy concerns. The dataset is poised to become a standard benchmark for studying the health of digital public spheres.

Key Points

Contains 75 million comments and 400 million signed votes from a 10-year period (2013-2022).
Data is anonymized and includes pre-computed text embeddings, not raw text, to preserve user privacy.
Provides threaded conversations, explicit up/downvotes, and topic labels for studying discourse dynamics in German.

Why It Matters

Provides a massive, privacy-safe benchmark for training AI models to understand online discourse, polarization, and community dynamics in non-English languages.

Read Original Article

Researchers release massive 10-year dataset of 75M German news forum comments

Why It Matters

Related Articles

🚀 Stay Ahead in AI