Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
First comprehensive benchmark of 7 LLMs across authorship, generation, and inference...
A new arXiv paper presents the first comprehensive evaluation of modern LLMs—including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT—across three core social media analytics tasks using a Twitter (X) dataset. For authorship verification, the team introduced a systematic sampling framework over diverse user and post selection strategies, evaluating generalization on newly collected tweets from January 2024 onward to mitigate seen-data bias. For post generation, they assessed the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics, and bridged these tasks with a user study measuring real users' perceptions of LLM-generated posts conditioned on their own writing.
For user attribute inference, the researchers annotated occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC), benchmarking LLMs against existing baselines. The study provides new insights into how well these models handle social media analytics tasks, establishing reproducible benchmarks for the field. The code and data are provided in the supplementary material and will be made publicly available upon publication.
- Evaluated 7 LLMs including GPT-4o, Gemini 1.5 Pro, DeepSeek-V3, and Llama 3.2 on 3 social media tasks
- Introduced a systematic sampling framework to mitigate seen-data bias using tweets from Jan 2024 onward
- Used IAB Tech Lab 2023 and 2018 U.S. SOC taxonomies for standardized occupation/interest annotation
Why It Matters
Establishes reproducible benchmarks for LLM-driven social media analytics, crucial for content moderation and user understanding.