VeriContest includes 946 problems from LeetCode and Codeforces with expert-validated formal specs, Rust code, Verus-checked proofs, and test suites?

VeriContest includes 946 problems from LeetCode and Codeforces with expert-validated formal specs, Rust code, Verus-checked proofs, and test suites.

Best LLM scored 92.18% on code generation but only 5.29% end-to-end when including spec and proof generation?

Best LLM scored 92.18% on code generation but only 5.29% end-to-end when including spec and proof generation.

Supports isolated and compositional evaluation of specification, code, proof, and full verified program synthesis?

Supports isolated and compositional evaluation of specification, code, proof, and full verified program synthesis.

Developer Tools

VeriContest benchmark reveals AI code generation's proof bottleneck at 5.29% success

arXiv cs.SE May 12, 2026

⚡Best model nails 92% on code but only 5% end-to-end verified generation.

Deep Dive

Researchers from multiple institutions (including the authors Zichen Xie, Mrigank Pawagi, and others) introduced VeriContest, a comprehensive benchmark of 946 competitive programming problems from LeetCode and Codeforces, designed to measure verifiable code generation in Rust using the Verus verification tool. Unlike standard benchmarks that only check functional correctness via testing, VeriContest requires models to produce formal specifications and machine-checkable proofs alongside executable code. The benchmark was constructed through a three-phase pipeline: manually verified seed problems, semi-automated expansion with human review, and a final quality-assurance layer using test suites to validate postcondition completeness.

Evaluating ten state-of-the-art LLMs revealed a stark gap between raw coding ability and verifiable generation. The best model achieved 92.18% on natural-language-to-code generation but only 48.31% on specification generation, 13.95% on proof generation, and just 5.29% when performing all tasks end-to-end. These results identify proof and specification generation as the central bottlenecks for current models. VeriContest provides a rigorous platform for measuring and training systems that generate code with machine-checkable correctness, pushing AI beyond simple test-passing toward formal software verification.

Key Points

VeriContest includes 946 problems from LeetCode and Codeforces with expert-validated formal specs, Rust code, Verus-checked proofs, and test suites.
Best LLM scored 92.18% on code generation but only 5.29% end-to-end when including spec and proof generation.
Supports isolated and compositional evaluation of specification, code, proof, and full verified program synthesis.

Why It Matters

Highlights that AI-assisted software verification is still nascent, making formal proof generation the next frontier for reliable coding assistants.

Read Original Article

VeriContest benchmark reveals AI code generation's proof bottleneck at 5.29% success

Why It Matters

Related Articles

🚀 Stay Ahead in AI