LLMs like GPT-4.1 can generate synthetic survey data for populations
Zero-shot LLM data matches real surveys on state-level health contrasts...
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A team from George Mason University evaluated whether LLMs can replace expensive survey data collection for population synthesis. They used GPT-4.1 and Gemini-2.5-Pro to generate synthetic Behavioral Risk Factor Surveillance System (BRFSS) health survey records for Colorado and Mississippi—two states with distinct demographic and health profiles. The zero-shot prompts asked the models to produce individual-level survey responses without any prior fine-tuning. These synthetic records were then plugged into a standard iterative proportional fitting (IPF) workflow to create geographically explicit synthetic populations at the census-tract level. Results showed both LLMs captured several major state-level contrasts, indicating ability to produce geographically differentiated data. However, performance was strongly variable-dependent—some health indicators like diabetes prevalence were well-replicated while others (e.g., mental health days) showed larger errors. The IPF process sometimes amplified these errors, sometimes reduced them, leading to mixed downstream accuracy. Spatial validation revealed that LLM-based populations reproduced census-tract patterns reasonably well for variables that were more aligned with ground truth data. Overall, the researchers conclude that LLM-generated survey data is promising as a supplementary input but not yet a replacement for real survey data. The paper, published on arXiv (2605.27401), contributes to growing interest in using AI for generating synthetic data in public health, epidemiology, and urban planning.
- GPT-4.1 and Gemini-2.5-Pro generated zero-shot synthetic health survey data that captured state-level differences between Colorado and Mississippi (e.g., obesity rates).
- Downstream IPF-based population synthesis showed variable-dependent accuracy: some census-tract patterns matched well, but errors in LLM data were not always corrected.
- LLM-generated survey data is promising as a supplement but cannot yet replace real surveys for geographically explicit population synthesis.
Why It Matters
LLM-generated synthetic data could drastically cut costs of population surveys for public health and urban planning, but current accuracy limits require cautious use.