AI Safety

LLMs like GPT-4.1 can generate synthetic survey data for populations

Zero-shot LLM data matches real surveys on state-level health contrasts...

Deep Dive

A team from George Mason University evaluated whether LLMs can replace expensive survey data collection for population synthesis. They used GPT-4.1 and Gemini-2.5-Pro to generate synthetic Behavioral Risk Factor Surveillance System (BRFSS) health survey records for Colorado and Mississippi—two states with distinct demographic and health profiles. The zero-shot prompts asked the models to produce individual-level survey responses without any prior fine-tuning. These synthetic records were then plugged into a standard iterative proportional fitting (IPF) workflow to create geographically explicit synthetic populations at the census-tract level. Results showed both LLMs captured several major state-level contrasts, indicating ability to produce geographically differentiated data. However, performance was strongly variable-dependent—some health indicators like diabetes prevalence were well-replicated while others (e.g., mental health days) showed larger errors. The IPF process sometimes amplified these errors, sometimes reduced them, leading to mixed downstream accuracy. Spatial validation revealed that LLM-based populations reproduced census-tract patterns reasonably well for variables that were more aligned with ground truth data. Overall, the researchers conclude that LLM-generated survey data is promising as a supplementary input but not yet a replacement for real survey data. The paper, published on arXiv (2605.27401), contributes to growing interest in using AI for generating synthetic data in public health, epidemiology, and urban planning.

Key Points
  • GPT-4.1 and Gemini-2.5-Pro generated zero-shot synthetic health survey data that captured state-level differences between Colorado and Mississippi (e.g., obesity rates).
  • Downstream IPF-based population synthesis showed variable-dependent accuracy: some census-tract patterns matched well, but errors in LLM data were not always corrected.
  • LLM-generated survey data is promising as a supplement but cannot yet replace real surveys for geographically explicit population synthesis.

Why It Matters

LLM-generated synthetic data could drastically cut costs of population surveys for public health and urban planning, but current accuracy limits require cautious use.