Research & Papers

LLM diversity collapse traced to calibration bottlenecks in 14 models

Why your AI keeps saying the same thing—it's a calibration problem, not just sampling.

Deep Dive

A new paper from authors at USC, including Amin Banayeeanzade and Robin Jia, introduces a validity–diversity framework that pinpoints why large language models produce repetitive outputs despite high temperature sampling. The team studied 14 models across multiple families and scales, finding that the core bottleneck is calibration—not just heuristic limitations. They identify two complementary issues: order calibration fails when valid tokens aren't reliably ranked above invalid ones, forcing rank-based cutoffs to trade off validity for diversity. Shape calibration fails when probability mass is overly concentrated on a few valid continuations while a heavy tail of mixed valid/invalid tokens remains, so maintaining high validity suppresses variety.

These local failures compound across decoding steps, leading to strong sequence-level diversity losses. The researchers developed controlled diagnostics with known valid sets and oracle cutoff baselines to probe these bottlenecks. Their results show that diversity collapse is not merely a limitation of particular sampling heuristics like top-k or top-p, but a fundamental consequence of how LLMs allocate probability mass during inference. This work provides a concrete lens for improving LLM diversity by targeting calibration directly, with implications for creative generation, scientific discovery, and any application requiring varied outputs.

Key Points
  • Diversity collapse in 14 LLMs traced to order and shape miscalibration, not just sampling heuristics
  • Order calibration failure: valid tokens not reliably ranked above invalid, forcing tradeoff in rank-based cutoffs
  • Shape calibration failure: probability mass overly concentrated on few valid tokens, limiting diversity at high validity

Why It Matters

Targeting calibration directly could unlock more diverse LLM outputs for creative and scientific use cases.