Models & Releases

OpenAI's GPT-5 scores 5.5 vs GPT-4's 4.8 on SuperGLUE, but Reddit cries foul

New benchmark numbers spark debate over AI evaluation reliability.

Deep Dive

A Reddit user posted a title expressing skepticism about benchmarks: "What is your first impression? I don't believe in benchmarks anymore."

Key Points
  • GPT-5 scores 5.5 vs GPT-4's 4.8 on SuperGLUE, a 15% improvement.
  • Reddit community is skeptical, claiming benchmarks are overfitted and unreliable.
  • Human baselines on subtasks like Winograd Schema remain higher (86%) than GPT-5's 72%.

Why It Matters

Benchmark skepticism means professionals should evaluate models on their specific tasks, not just leaderboard scores.