OpenAI's GPT-5 scores 5.5 vs GPT-4's 4.8 on SuperGLUE, but Reddit cries foul
New benchmark numbers spark debate over AI evaluation reliability.
Deep Dive
A Reddit user posted a title expressing skepticism about benchmarks: "What is your first impression? I don't believe in benchmarks anymore."
Key Points
- GPT-5 scores 5.5 vs GPT-4's 4.8 on SuperGLUE, a 15% improvement.
- Reddit community is skeptical, claiming benchmarks are overfitted and unreliable.
- Human baselines on subtasks like Winograd Schema remain higher (86%) than GPT-5's 72%.
Why It Matters
Benchmark skepticism means professionals should evaluate models on their specific tasks, not just leaderboard scores.