Developer Tools

VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

New 500-proof benchmark shows AI models trained on math fail at real-world software verification tasks.

Deep Dive

A research team from UT Austin has released VeriSoftBench, a new benchmark designed to rigorously test AI's ability to handle formal software verification in realistic development environments. Unlike existing benchmarks drawn from mathematical libraries like Mathlib, VeriSoftBench comprises 500 Lean 4 proof obligations sourced from actual open-source formal-methods projects, packaged to preserve crucial repository context and cross-file dependencies.

The evaluation of frontier large language models (LLMs) and specialized theorem provers yielded critical insights. First, models fine-tuned for mathematical theorem proving show poor transfer to this software-centric setting, indicating a significant domain gap. Second, performance is strongly correlated with 'transitive repository dependence'—proofs requiring knowledge from large, multi-hop dependency chains are far less likely to be solved. Third, while providing curated context from a proof's dependency closure improves results over exposing the entire repository, it still leaves substantial room for improvement, with current AI systems struggling with the scale and specificity of real-world codebases.

This work, published on arXiv, establishes a crucial new standard for measuring progress in AI-assisted formal methods. It moves the goalposts from solving isolated mathematical problems to navigating the complex, definition-heavy ecosystems of software verification. The findings suggest that future AI systems for code verification will need new architectures or training approaches specifically designed to reason across entire repositories, not just individual theorems.

Key Points
  • Benchmark contains 500 Lean 4 proof obligations from real open-source software verification projects, preserving full repository context.
  • Shows AI models trained on mathematical proofs (Mathlib) fail to transfer to software verification, with performance dropping on complex dependencies.
  • Providing curated dependency context helps but leaves a large performance gap, highlighting a core challenge for AI in practical software engineering.

Why It Matters

It reveals a major limitation in current AI for real-world software verification, steering research toward systems that can reason across entire codebases.