New IRT-based framework evaluates LLMs orders of magnitude faster
Researchers achieve 10x+ speedup in LLM evaluation with guaranteed convergence.
Evaluating large language models (LLMs) is becoming increasingly critical, but standard benchmarking methods relying on average accuracy fail to account for output stochasticity and item heterogeneity. Item Response Theory (IRT) offers a principled alternative for modeling latent model abilities and item characteristics, yet conventional IRT implementations are computationally expensive and numerically unstable, limiting large-scale use. In a new arXiv paper (arXiv:2605.07046), Xinhao Qu and colleagues introduce an interpretable and scalable framework that overcomes these limitations by applying the majorization-minimization (MM) principle. Their approach recasts the evaluation problem as a sequence of constrained matrix factorization subproblems, enabling stable parameter estimation with strong theoretical guarantees for identifiability and convergence.
Testing on synthetic data and real-world benchmarks including MATH-500 and six datasets from the Open LLM Leaderboard, the method delivers orders-of-magnitude speedups over competing IRT implementations while matching or exceeding estimation accuracy. The framework also yields interpretable item difficulty and discrimination parameters, aligning with established scaling laws and offering actionable insights for benchmark design. This work promises to make rigorous, scalable LLM evaluation more accessible to researchers and practitioners, ultimately leading to more reliable model comparisons and smarter benchmark construction.
- Reformulates LLM evaluation as constrained matrix factorization using the majorization-minimization principle for stable, efficient estimation.
- Delivers orders-of-magnitude speedups over traditional IRT methods on MATH-500 and six Open LLM Leaderboard benchmarks.
- Provides interpretable item difficulty and discrimination parameters with theoretical guarantees for identifiability and convergence.
Why It Matters
Faster, interpretable LLM evaluation enables more rigorous benchmark design and better model comparison at scale.