Developer Tools

FVSpec benchmark tests AI on real-world formal verification

2,772 Python property-based tests turned into 9,415 Lean 4 proofs.

Deep Dive

A team of researchers (Quinn Dougherty, Max von Hippel, Hazel Shackleton, Mike Dodds) has released FVSpec, a new benchmark that pushes AI models and agents to solve real-world formal software verification problems. The core idea: take existing Python property-based tests (PBTs) from actual open-source repositories — 11,039 in total — and automatically translate a subset (2,772 PBTs, roughly 25%) into Lean 4 specifications. Each PBT yields about 3 formalizations on average, resulting in 9,415 Lean 4 files with `sorry` placeholders (i.e., unproven theorems). The translation pipeline itself is notable: it uses a three-agent LLM system to model Python semantics in Lean, infer logical properties from imperative tests, and handle the complexities of dependently typed programming. The authors provide baselines using several automated and model-based proof generation approaches, making it a comprehensive testbed.

The challenge is significant. Translating imperative Python PBTs to Lean — a language vastly different from Python — requires deep understanding of both the code's runtime behavior and formal logic. This benchmark directly addresses an underexplored problem: as AI generates an increasing share of the world's code, verifying that code is safe and correct becomes crucial. FVSpec provides a standardized way to measure progress on AI-assisted formal verification. The entire scraper, agent pipeline, and dataset are open source, inviting the community to improve upon the baselines. By connecting real-world software engineering with formal methods, FVSpec could help bridge the gap between practical AI code generation and rigorous verification.

Key Points
  • 11,039 Python property-based tests scraped from real-world repos; 2,772 translated into 9,415 Lean 4 specifications (3 per PBT on average)
  • Uses a three-agent LLM pipeline to transpile Python PBTs into Lean 4 with `sorry` placeholders, handling Python semantics and logical inference
  • Open-source benchmark with baselines for automated and model-based proof generation, targeting AI-assisted formal verification

Why It Matters

As AI generates more code, formal verification benchmarks like FVSpec are critical for ensuring safety and correctness.