From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation
New AI system replaces absolute scoring with pairwise comparisons, achieving robust generalization across five datasets.
A team of researchers has published a paper proposing a fundamental shift in how large language models (LLMs) are used to evaluate scientific papers. The current standard approach involves training models to assign an absolute score to each paper independently. The authors argue this method is flawed because scoring scales vary dramatically across different conferences, time periods, and criteria, causing models to learn narrow, context-specific rules rather than developing robust scholarly judgment.
To solve this, the team introduces CNPE (Comparison-Native framework for Paper Evaluation). This framework moves from 'isolated scoring' to 'collaborative ranking.' It integrates comparison into both data construction and model learning. First, a graph-based similarity ranking algorithm samples more informative and discriminative paper pairs from a collection. The LLM is then enhanced through supervised fine-tuning and reinforcement learning using comparison-based rewards. At inference, the model performs pairwise comparisons over sampled pairs and aggregates these preference signals into a global ranking.
The results are significant. In experiments, the CNPE framework achieved an average relative improvement of 21.8% over the strong baseline model DeepReview-14B. Crucially, it also demonstrated robust generalization, performing well on five previously unseen datasets. This suggests the comparison-native approach helps models learn a more transferable understanding of paper quality, moving beyond fitting to a specific dataset's scoring quirks. The code for the project is publicly available, inviting further development in the field of AI-assisted academic evaluation.
- Proposes a shift from absolute scoring to relative ranking for LLM-based paper evaluation, addressing the variability of score scales across contexts.
- The CNPE framework uses a graph-based algorithm for sampling paper pairs and trains models with comparison-based rewards via SFT and RL.
- Achieved a 21.8% average improvement over DeepReview-14B and showed strong generalization to five unseen datasets.
Why It Matters
This makes AI-augmented peer review and paper ranking more reliable, scalable, and less biased by specific conference scoring traditions.