Update on the First Proof Questions: Gemini 3 Deepthink and GPT-5.2 pro were able to get questions 9 and 10 right according to the organizers
Top AI models just cracked the hardest problems in a major new benchmark.
Deep Dive
In the First Proof benchmark, Gemini 3 Deepthink and GPT-5.2 Pro correctly solved questions 9 and 10—the two most difficult problems. Each model had two attempts with different prompts. The other eight questions remained unsolved. This test, using publicly available models, highlights the current frontier of AI's advanced mathematical reasoning capabilities, showing where even the most powerful systems begin to struggle with complex, multi-step proof generation.
Why It Matters
It reveals the precise limit of today's top AI models in solving elite-level, formal mathematical reasoning.