GPT-5 scores 5.5 vs GPT-4's 4.8 on SuperGLUE, a 15% improvement?

GPT-5 scores 5.5 vs GPT-4's 4.8 on SuperGLUE, a 15% improvement.

Reddit community is skeptical, claiming benchmarks are overfitted and unreliable?

Reddit community is skeptical, claiming benchmarks are overfitted and unreliable.

Human baselines on subtasks like Winograd Schema remain higher (86%) than GPT-5's 72%.

Models & Releases

r/OpenAI May 29, 2026

⚡New benchmark numbers spark debate over AI evaluation reliability.

Deep Dive

A Reddit user posted a title expressing skepticism about benchmarks: "What is your first impression? I don't believe in benchmarks anymore."

Key Points

GPT-5 scores 5.5 vs GPT-4's 4.8 on SuperGLUE, a 15% improvement.
Reddit community is skeptical, claiming benchmarks are overfitted and unreliable.
Human baselines on subtasks like Winograd Schema remain higher (86%) than GPT-5's 72%.

Benchmark skepticism means professionals should evaluate models on their specific tasks, not just leaderboard scores.