The Qwen team verified that there are serious problems with the data quality of the GPQA and HLE test sets.
Research confirms popular AI benchmarks contain wrong answers, undermining model evaluations.
New research from the Qwen team has exposed significant data quality problems in two widely-used AI benchmark datasets, potentially undermining recent claims about model capabilities. Their paper, published on arXiv, systematically documents fundamental flaws in both the GPQA (Graduate-Level Google-Proof Q&A) and Humanity's Last Exam (HLE) test sets, which have been used to evaluate cutting-edge AI models like GPT-4, Claude 3, and Llama 3.
The investigation began when independent researchers working on the 'DeepSeek-Overclock' project noticed their model was producing mathematically correct answers that contradicted the benchmark's 'gold standard' labels. After writing Python scripts to verify answers line-by-line from first principles, they discovered the benchmark data contained incorrect solutions. The Qwen team's subsequent paper confirms these findings, stating bluntly that 'a lot of the questions in the HLE test set are fundamentally broken' and that some 'standard answers are straight-up wrong.'
This revelation has significant implications for AI evaluation. Benchmarks like GPQA and HLE have been used to demonstrate model progress toward human-level reasoning, with companies frequently citing performance on these tests in announcements. The flawed data means some reported performance gains may be misleading, as models could be penalized for correct reasoning or rewarded for matching incorrect answers. The research community now faces the challenge of developing more reliable evaluation methods that truly measure reasoning capabilities rather than benchmark-specific pattern matching.
- Qwen research paper confirms GPQA and HLE benchmarks contain incorrect 'gold standard' answers
- Independent verification found models producing mathematically correct solutions that contradicted flawed benchmark labels
- Flaws undermine recent AI performance claims and highlight need for more rigorous evaluation methods
Why It Matters
Flawed benchmarks distort AI progress measurements, forcing reevaluation of recent model capability claims and development of better testing standards.