New AI math benchmark finds GPT-5.4 Pro has made progress on two unsolved math problems
GPT-5.4 Pro beat human baselines on two unsolved math problems after an hour of reasoning, showing new research capabilities.
Researchers from the University of Oxford have developed a novel benchmark to test AI capabilities in pure mathematics, focusing on 100 problems that remain unsolved by humans. Their study reveals that OpenAI's latest model, GPT-5.4 Pro, has made verifiable progress on two of these challenges, marking a significant step for AI in formal research domains. The benchmark itself is a rigorous test bed designed to move beyond standard solved problems and evaluate an AI's capacity for genuine mathematical discovery.
In the experiments, GPT-5.4 Pro engaged in extended reasoning sessions, roughly an hour long, to work on the problems. On a Kakeya-type problem concerning geometric set overlaps, the model produced an optimized triangle overlap solution that outperformed the baseline from AlphaEvolve by approximately 4.9%. For a problem related to diagonal Ramsey numbers—a central topic in combinatorics—the AI applied a quintic correction to lower the bounding constant by about 2.7%. The researchers are currently validating these proposed advancements with expert mathematicians, a process detailed in their accompanying paper and social media discussions.
This work shifts the goalpost for AI evaluation from memorization and pattern recognition on known tasks to active contribution in fields requiring deep, structured reasoning. The success of GPT-5.4 Pro suggests that large language models, when given sufficient time and context, can begin to operate as research assistants in highly specialized disciplines. It highlights a new potential workflow where AI can propose novel conjectures or optimizations that human experts can then rigorously verify and build upon.
- Oxford researchers created a benchmark of 100 unsolved math problems to test AI research prowess.
- OpenAI's GPT-5.4 Pro improved a solution to a Kakeya-type problem by ~4.9% and reduced a Ramsey number constant by ~2.7%.
- The model achieved this through ~1 hour of reasoning per problem, with results now undergoing expert validation.
Why It Matters
This demonstrates AI's potential to act as a collaborative research tool in pure mathematics and other formal sciences, aiding human discovery.