Research & Papers

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

A new private benchmark reveals frontier AI models struggle with formally verified graduate-level mathematics.

Deep Dive

A team of researchers led by Nikil Ravi has unveiled FormalProofBench, a new private benchmark designed to rigorously test whether frontier AI models can produce formally verified mathematical proofs at the graduate level. Each task in the benchmark pairs a natural-language problem with a formal statement in Lean 4, a popular theorem prover. The model must then generate a complete Lean proof that is accepted by the Lean 4 checker. The problems are drawn from advanced undergraduate and graduate mathematics, sourced from qualifying exams and standard textbooks across topics like analysis, algebra, probability, and logic.

In their evaluation, the researchers tested a range of frontier models using an agentic harness—a setup where the AI can use tools and reason step-by-step. The results reveal a significant challenge: the best-performing foundation model achieved an accuracy of just 33.5%. Performance dropped rapidly for other models, indicating that current AI systems still struggle with the deep reasoning and precision required for formal theorem proving. The study, accepted at the ICLR 2026 VerifAI-2 workshop, also includes empirical analysis of tool-use patterns, common failure modes, and the associated computational cost and latency, providing a comprehensive assessment of the current state of AI in formal mathematics.

Key Points
  • The benchmark tests AI on writing Lean 4 proofs for graduate-level math problems, with the top model scoring only 33.5% accuracy.
  • Problems are sourced from real qualifying exams and textbooks across analysis, algebra, probability, and logic.
  • The evaluation uses an agentic harness to test models and includes analysis of tool-use, failures, cost, and latency.

Why It Matters

This benchmark sets a high bar for AI reasoning, pushing models toward reliable formal verification needed for scientific discovery and safety-critical systems.