AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
A new study reveals a massive performance gap in AI's ability to write formally verified code.
Researchers introduced AlgoVeri, a benchmark for evaluating AI models on generating formally verified code for 77 classical algorithms across three languages (Dafny, Verus, Lean). The results expose a critical capability gap: while frontier models like Gemini-3 Flash achieved 40.3% success in Dafny, performance collapsed to 24.7% in Verus and just 7.8% in Lean. The benchmark highlights how language design and required proof construction create major barriers for current AI systems.
Why It Matters
This reveals a fundamental limitation in current AI models for generating reliable, bug-free software, which is critical for safety-critical applications.