Developer Tools

Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

An 8B-parameter AI model beats neural provers 84x larger at formal code verification in Lean 4.

Deep Dive

A research team from Princeton University and UC Davis has introduced Goedel-Code-Prover, a novel AI system designed to automate the formal verification of code. The core innovation is a hierarchical proof search framework that tackles the complex challenge of generating machine-checkable proofs in Lean 4, a theorem prover and programming language. Instead of attempting to prove a complex goal directly, the system first decomposes it into simpler, more manageable subgoals. A key component is a principled 'decomposition score' that guides this process, serving as both the training reward and the inference-time ranking criterion to ensure alignment between development and deployment.

The team trained Goedel-Code-Prover-8B, a single 8-billion-parameter model that acts as a unified policy for both decomposition and proof completion. Training involved supervised initialization followed by hybrid reinforcement learning, where a continuous decomposition reward drives exploration while supervised replay stabilizes proof generation. The results are striking: on three Lean-based code verification benchmarks comprising 427 tasks, the model achieved a 62.0% prove success rate. This represents a 2.6x improvement over the strongest baseline and surpasses the performance of neural theorem provers up to 84 times larger in parameter count.

Furthermore, the system demonstrates consistent inference-time scaling, meaning its success rate improves reliably with more search iterations and computational budget. This efficiency allows the trained 8B model to outperform larger, off-the-shelf frontier models of comparable scale. The work addresses a critical gap in software engineering: while LLMs can generate plausible code, they offer weak guarantees of correctness. Goedel-Code-Prover moves the needle toward automated, high-assurance verification that a piece of code mathematically satisfies its specifications.

Key Points
  • Achieves 62.0% prove success rate on 427 Lean 4 verification tasks, a 2.6x improvement over the strongest baseline.
  • The 8B-parameter model outperforms neural theorem provers up to 84x larger (e.g., 672B parameters).
  • Uses a hybrid reinforcement learning approach with a unique decomposition score for training and inference alignment.

Why It Matters

This brings us closer to AI that can automatically prove software correctness, a major step for safety-critical systems in aerospace, finance, and infrastructure.