Research & Papers

Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets

New AI system creates realistic disaster tweets to train models, solving X's restrictive data access problem.

Deep Dive

A team of researchers has introduced a novel solution to a growing problem in crisis informatics: the scarcity of real social media data for training AI. Due to restrictive API changes at X (formerly Twitter), it's become increasingly difficult for researchers to access and curate real-world tweet datasets from events like earthquakes or floods. The team's new agentic workflow, detailed in an arXiv preprint, uses AI agents to systematically generate synthetic tweets that mimic the characteristics of real crisis communications.

The workflow operates through an iterative loop. First, it generates synthetic tweets conditioned on specific target labels, such as a geographic location and a level of damage. These tweets are then evaluated by predefined compliance checks. The system incorporates structured feedback from these evaluations to refine the tweets in subsequent iterations, progressively improving their quality and relevance. In a case study focused on post-earthquake scenarios, the workflow successfully generated datasets that captured the target labels for location and damage level.

Crucially, the researchers demonstrated that these synthetic datasets are not just plausible text; they are functional for AI development. The synthetic tweets were used to train and evaluate AI systems on critical crisis informatics tasks, specifically geolocalization (pinpointing where a tweet originated) and damage level prediction. The results indicate that this agentic approach offers a flexible and scalable alternative to the costly and limited process of curating real-world data, enabling the development of AI tools for diverse crisis events and societal contexts.

Key Points
  • The workflow uses an iterative, agentic loop to generate, evaluate with compliance checks, and refine synthetic tweets based on target characteristics like location and damage.
  • In a post-earthquake case study, the system created datasets usable for training AI models on geolocalization and damage level prediction tasks.
  • It directly addresses the problem of X's restrictive data access policies, which have severely limited the curation of real-world crisis tweet datasets for research.

Why It Matters

Enables continued AI research for disaster response by creating realistic training data, circumventing restrictive social media platform policies.