Research & Papers

LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

94% of weak-to-strong model pairs are reachable through benchmark win chains.

Deep Dive

Reddit user Spico197 introduced LLM Win, a website that converts LLM benchmark results into a directed graph. In this graph, if model A outperforms model B on any benchmark, an edge A → B is added. The tool then searches for the shortest transitive chain between any two models, revealing absurd but statistically robust connections, such as LLaMA 2 7B beating Claude Opus 4.7 via a series of benchmark wins. Analysis of 126,937 weak-to-strong pairs (where the source has a lower Intelligence Index than the target) found 94.2% are reachable through these chains, with 91.4% of paths requiring only 2-3 hops. This indicates the structure is not due to cherry-picking long chains.

The study also identified thousands of direct reversal triples—cases where a weaker model scores higher than a stronger one on a specific benchmark. Benchmarks with high reversal rates include Humanity's Last Exam, IFBench, AIME 2025, TAU2, and SciCode. For example, IFBench shows a 17.5% reversal rate with 80% coverage and a correlation of r≈0.82 with the Intelligence Index, suggesting it measures an independent skill rather than just replicating overall rankings. Spico197 concludes that LLM rankings are better represented as benchmark-specific capability graphs, not a single ladder, and poses the question: is this reversal structure a useful evaluation signal or just noise from imperfect benchmarks?

Key Points
  • 94.2% of weak-to-strong model pairs are reachable via benchmark transitive chains, with 91.4% requiring only 2-3 hops.
  • High-reversal benchmarks include Humanity's Last Exam, IFBench, AIME 2025, TAU2, and SciCode, with IFBench showing a ~17.5% reversal rate.
  • The findings suggest LLM evaluation should use capability graphs rather than a single ladder, potentially improving specialist identification and benchmark design.

Why It Matters

For AI evaluators, this challenges single-benchmark leaderboards and points to multidimensional capability assessment.