Developer Tools

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

arXiv cs.SE February 11, 2026

⚡A new study reveals a massive performance gap in AI's ability to write formally verified code.

Deep Dive

Researchers introduced AlgoVeri, a benchmark for evaluating AI models on generating formally verified code for 77 classical algorithms across three languages (Dafny, Verus, Lean). The results expose a critical capability gap: while frontier models like Gemini-3 Flash achieved 40.3% success in Dafny, performance collapsed to 24.7% in Verus and just 7.8% in Lean. The benchmark highlights how language design and required proof construction create major barriers for current AI systems.

Why It Matters

This reveals a fundamental limitation in current AI models for generating reliable, bug-free software, which is critical for safety-critical applications.

Read Original Article

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Why It Matters

Stay Ahead in AI