Towards Operational Validation of LLM-Agent Social Simulations: A Replicated Study of a Reddit-like Technology Forum
30 independent simulations of a tech forum reveal LLM agents mirror real users' activity patterns...
A team of researchers from Serbia conducted a rigorous validation of LLM-agent social simulations by running 30 independent 30-day simulations of a technology forum modeled on Voat's v/technology. They used stateless Dolphin Mistral 24B agents on the Y Social platform and evaluated operational validity across five dimensions: activity patterns, network structure, toxicity, topical coverage, and stylistic convergence. Against 30 matched, non-overlapping 30-day Voat comparison windows, results showed overlapping 99% confidence intervals for unique users, root posts, and daily active users, indicating that LLM agents can reproduce key metrics of real online communities.
However, systematic divergences emerged. Simulated comments, average thread length, and mean toxicity remained higher than real-world data. Both simulated and empirical networks exhibited core-periphery structure, but simulated cores were larger and more diffuse, with less frequent repeated interactions. Topic alignment was near-complete, but toxicity was misallocated across content layers: simulated root posts were substantially more toxic than real submissions, while simulated comments were less toxic than Voat comments. These findings demonstrate that LLM agents in platform-faithful environments can reproduce familiar online regularities, while pointing to concrete directions for improvement, particularly around stateless agent design and content-layer calibration.
- 30 independent 30-day simulations using Dolphin Mistral 24B agents on Y Social platform showed overlapping 99% confidence intervals for unique users, root posts, and daily active users versus real Voat data
- Simulated networks exhibited core-periphery structure but with larger, more diffuse cores and less frequent repeated interactions than real-world forums
- Toxicity misallocation: simulated root posts were substantially more toxic than real submissions, while simulated comments were less toxic than Voat comments
Why It Matters
Validates LLM agents for social simulation research, revealing precise divergences to fix for more realistic digital twins of online communities.