GPT-5.5 catches fatal errors in ~33% of FrontierMath benchmark problems
Epoch's AI-assisted review exposes flawed benchmark data using GPT-5.5
FrontierMath, designed as one of the hardest benchmarks for frontier AI models, has faced a revelation. Epoch AI conducted an AI-assisted review that uncovered fatal errors in approximately one-third of problems within Tiers 1–4. Noam Brown, a prominent AI researcher, stated that the initial flags were generated by OpenAI's GPT-5.5. This indicates that the model's reasoning capabilities are sufficiently advanced to identify flawed questions in a benchmark intended to test the very limits of AI performance. The discovery forces a re-evaluation of previous scores and the benchmark's reliability.
For the AI community, this is a pivotal moment: a model strong enough to sanity-check its own evaluation metrics. The incident underscores the need for ongoing quality assurance in benchmarking and opens questions about how future tests are validated. Researchers now await corrected scores from Epoch AI, while discussions intensify regarding the meta-capability of AI to audit and improve its own testing environment. This event may reshape how benchmarks are constructed and maintained, shifting some responsibility onto the models themselves.
- FrontierMath is considered one of the hardest benchmarks for frontier AI models.
- GPT-5.5 flagged fatal errors in roughly one-third of problems in Tiers 1–4.
- Epoch AI's review confirms the model's ability to audit evaluation data.
Why It Matters
Shows AI can now audit its own benchmarks, questioning reliability of current evaluation methods.