Media & Culture

GPT-5.4 Thinking benchmarks

The new model reportedly scores 92% on MMLU and 85% on GPQA, outperforming its predecessor.

Deep Dive

A viral leak suggests OpenAI has internally tested a new model, GPT-5.4, which has achieved record-breaking scores on major AI reasoning benchmarks. The reported results, shared on social media, indicate the model scored 92% on the MMLU (Massive Multitask Language Understanding) benchmark and 85% on the notoriously difficult GPQA (Graduate-Level Google-Proof Q&A) benchmark. These scores, if accurate, would represent a clear and substantial performance jump over the current flagship model, GPT-4o, particularly in areas requiring deep, multi-step reasoning and expert-level knowledge. The leak has ignited speculation about the model's architecture and its imminent public release.

While OpenAI has not officially confirmed GPT-5.4's existence or these specific results, the benchmarks cited are critical industry standards for evaluating AI reasoning. MMLU tests broad knowledge across 57 subjects, while GPQA is a diamond-standard benchmark for scientific reasoning, designed to be extremely challenging even for human experts. A score of 85% on GPQA would indicate a model capable of advanced scientific and technical problem-solving. The immediate implication is a new tier of AI assistant for complex R&D, data science, and academic research. The AI community is now watching for an official announcement, which would likely detail the model's capabilities, context window, and multimodal features.

Key Points
  • Reported 92% score on MMLU benchmark, indicating superior broad knowledge and understanding.
  • Alleged 85% score on the expert-level GPQA benchmark, suggesting breakthrough scientific reasoning.
  • Performance reportedly surpasses GPT-4o, signaling a major step in AI reasoning capability.

Why It Matters

Advances in reasoning benchmarks directly translate to more reliable AI for complex analysis, research, and technical problem-solving.