AI Safety

A Fast and Loose Clustering of LLM Benchmarks

New analysis shows chess puzzles correlate with AGI, while coding and math skills diverge in top models.

Deep Dive

A new analysis by Epoch AI has integrated 37 distinct AI benchmarks into a single 'Capabilities Index,' providing a clearer picture of which models lead the pack. The current top performers are OpenAI's GPT-5.4, Google's Gemini 3.1, and Anthropic's Claude Opus 4.6, though performance varies wildly by domain; for example, OpenAI reportedly outperforms Google by 74% on the 'NumberInNameBench.' The hardest benchmarks today measure wildly different skills, from complex desktop tasks (OS Universe) and software optimization (GSO Bench) to building factories in the game Factorio and solving PhD-level math problems (FrontierMath Tier 4).

Crucially, the analysis goes beyond a single score by statistically clustering benchmarks based on how similarly models perform on them. This reveals which capabilities are correlated—meaning they tend to improve together—versus which are independent. For instance, the data suggests surprising connections, like chess puzzle performance grouping with measures of general intelligence (AGI), rather than strictly with math or coding. This method challenges simple, name-based categorization (e.g., assuming all 'SWE' benchmarks are alike) and provides a more nuanced map of the AI capability landscape as models evolve from chatbots to autonomous agents.

This shift reflects the industry's move away from evaluating mere reasoning and knowledge. Companies like OpenAI now report separate scores for coding, tool use, knowledge work, and computer use in their model announcements. The clustering analysis helps explain why a model could excel at long-horizon tasks like playing Factorio without showing similar gains in advanced mathematics, highlighting the increasing specialization and divergent development paths of modern AI systems.

Key Points
  • Epoch's Capabilities Index condenses 37 benchmarks into one score, ranking GPT-5.4, Gemini 3.1, and Claude Opus 4.6 as leaders.
  • Statistical clustering reveals non-obvious benchmark relationships; e.g., chess performance correlates with general intelligence (AGI) measures.
  • The hardest benchmarks test disparate skills: OS Universe (desktop tasks), GSO Bench (software optimization), Factorio (game agency), and FrontierMath (advanced math).

Why It Matters

For professionals, this clarifies which models excel at specific tasks (coding vs. agency) beyond marketing hype, enabling better tool selection.