POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
420 real-world tasks from 121 projects test LLM-generated postconditions for correctness and completeness.
Formal postconditions are precise specifications that characterize program behavior and are essential for debugging, testing, and verification. However, writing them manually requires significant expertise. Recent work has turned to large language models (LLMs) to automatically generate postconditions from code and natural-language artifacts. But evaluating these generated postconditions has been a bottleneck—existing benchmarks focus heavily on correctness and rely on surface-form matching or small synthetic datasets.
To address this, researchers from academia introduced POSTCONDBENCH, a benchmark of 420 real-world Python and Java tasks from 121 open-source projects. Each task includes expert-constructed ground-truth postcondition sets. The benchmark provides a runnable execution environment and operationalizes completeness through defect discrimination: a postcondition set is more complete if it's violated by more defective implementations while remaining satisfied on correct runs. Testing five state-of-the-art LLMs revealed a significant gap between correctness and completeness, with repository-level dependencies and method complexity widening that gap. The work highlights the challenge of generating truly robust and useful formal specifications.
- POSTCONDBENCH includes 420 tasks (Python and Java) from 121 real open-source projects with expert-validated ground-truth postconditions.
- Completeness is measured via defect discrimination: a postcondition set is more complete if it catches more buggy implementations while passing correct ones.
- Evaluation of 5 SOTA LLMs shows a substantial gap between correctness and completeness, exacerbated by repository-level dependencies and method complexity.
Why It Matters
This benchmark sets a new standard for evaluating AI-generated formal specifications, impacting automated debugging, testing, and verification in software engineering.