Research & Papers

MultiwayPAM: Multiway Partitioning Around Medoids for LLM-as-a-Judge Score Analysis

New tensor clustering technique reveals systematic bias in AI judges like GPT-4 and Claude, saving computational costs.

Deep Dive

Researchers Chihiro Watanabe and Jingyu Sun have introduced MultiwayPAM, a novel tensor clustering method designed to tackle two major challenges in the increasingly popular LLM-as-a-Judge evaluation framework. This framework uses large language models like GPT-4 or Claude to score text quality across different dimensions by varying prompt templates. The new method addresses both the high computational cost of running numerous LLM inferences and the inherent biases that different AI evaluators introduce into scoring systems.

MultiwayPAM works by analyzing a three-dimensional score tensor containing evaluations across combinations of questions, answerers, and evaluators. Unlike traditional approaches, it simultaneously estimates cluster membership and identifies medoids (representative examples) for each mode of the data tensor. This allows researchers to observe which questions, responses, or evaluators group together based on scoring patterns, revealing systematic biases in how different LLMs evaluate similar content.

The researchers validated their method on two practical datasets, demonstrating that MultiwayPAM can effectively uncover the underlying structure of evaluator bias. By identifying these patterns, teams can better understand why different AI judges might assign divergent scores to the same responses, leading to more reliable evaluation frameworks. This approach significantly reduces the need for expensive, repeated LLM inference runs by providing insights into the fundamental biases present in AI evaluation systems.

Key Points
  • MultiwayPAM uses tensor clustering to analyze 3D score data (questions × answerers × evaluators) from LLM-as-a-Judge systems
  • The method simultaneously identifies clusters and representative medoids for each dimension, revealing systematic bias patterns
  • Validated on two practical datasets, reducing computational costs by identifying underlying bias structures without repeated LLM inference

Why It Matters

Enables more reliable AI evaluation by systematically identifying and accounting for LLM judge biases, reducing computational overhead.