WhaVax dataset targets vaccine misinformation on WhatsApp with expert annotations
10 pages of expert-annotated WhatsApp vaccine messages with benchmark results...
Researchers from multiple Brazilian institutions (including UFMG) have released WhaVax, a rigorously curated, expert-annotated dataset of vaccine-related WhatsApp messages collected from large Brazilian public groups over several pandemic years. The pipeline combines keyword-based collection, semantic deduplication to eliminate near-duplicates, and a multi-stage annotation protocol executed by medical specialists. The resulting gold-standard corpus shows substantial inter-annotator agreement and enables reliable downstream analysis of health misinformation in encrypted messaging environments. The dataset also characterizes WhatsApp misinformation through linguistic, structural, lexical, temporal, and group-level patterns, including a layer of ambiguous cases that reflect the complexity of real-world health discourse.
Benchmarking classical machine learning models, fine-tuned Small Language Models (SLMs), and zero/few-shot Large Language Models (LLMs) under realistic data-scarcity constraints shows that strong embeddings and LLM approaches perform competitively. However, domain alignment (i.e., fine-tuning on in-domain data) and data availability remain critical factors for optimal detection. The study highlights the unique challenges of combating misinformation in private messaging, where content cannot be easily monitored or moderated. WhaVax provides a rare, high-quality resource to support computational modeling and misinformation research, particularly for Portuguese-language content and encrypted platforms like WhatsApp.
- WhaVax includes vaccine-related WhatsApp messages from large Brazilian public groups over multiple pandemic years, expert-annotated by medical specialists with high inter-annotator agreement.
- The dataset pipeline features keyword-based collection, semantic deduplication, and multi-stage annotation, yielding a gold-standard corpus with linguistic, structural, and temporal patterns.
- Benchmarks show strong embeddings and LLMs are competitive for misinformation detection under data scarcity, but domain alignment and data availability remain critical.
Why It Matters
Provides a rare, high-quality resource for detecting health misinformation in encrypted messaging, a growing challenge for public health.