Research & Papers

Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

11 frontier models tested. Only 7% pass rate in batch generation.

Deep Dive

A new paper accepted to ACL 2026, titled "Large Language Models Are Bad Dice Players," presents the first large-scale audit of native probabilistic sampling in frontier LLMs. Researchers Minda Zhao, Yilun Du, and Mengyu Wang benchmarked 11 models across 15 statistical distributions using a dual-protocol design: Batch Generation (1,000 samples in one response) and Independent Requests (1,000 stateless calls). The results reveal a sharp asymmetry—batch generation achieved only a 7% median pass rate, while independent requests collapsed almost entirely, with 10 of 11 models failing every distribution. Sampling fidelity also degraded monotonically with distributional complexity and as the sampling horizon N increased.

These failures aren't just academic. The study demonstrates real-world impact: models couldn't enforce uniform answer-position constraints in multiple-choice question generation, and they systematically violated demographic targets in attribute-constrained text-to-image prompt synthesis. The authors conclude that current LLMs lack a functional internal sampler, making external tools necessary for any application requiring statistical guarantees—from Monte Carlo simulations to fairness-aware content generation. The paper highlights a critical blind spot as LLMs move from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence.

Key Points
  • Batch generation achieved only a 7% median pass rate across 11 models and 15 distributions
  • Independent request mode collapsed: 10 of 11 models failed ALL distributions tested
  • Failures propagate to real tasks like biased MCQ generation and skewed demographic prompts

Why It Matters

LLMs can't reliably generate random numbers, threatening fairness and accuracy in stochastic pipelines.