Developer Tools

PBT-Bench: New AI benchmark for property-based testing tasks

LLMs achieve only 31–83% bug recall on 100 curated Python library problems

Deep Dive

A team led by Lucas Jing from arXiv has released PBT-Bench, a new benchmark designed to isolate and evaluate AI agents' ability to perform property-based testing. Unlike existing code benchmarks that only test bug reproduction or patch generation, PBT-Bench requires agents to derive semantic invariants from library documentation and construct precise input-generation strategies that can trigger hidden bugs. The benchmark comprises 100 problems drawn from 40 popular Python libraries, with a total of 365 injected bugs (mean 3.65 per problem) stratified across three difficulty levels—from single-constraint boundary bugs to complex stateful protocol violations.

In their evaluation of eight contemporary LLMs under two prompting regimes (open-ended baseline vs. explicit Hypothesis framework scaffolding), the team found significant variance in performance. Bug recall with the structured Hypothesis prompt ranged from 42.1% to 83.4%, while the open-ended baseline yielded 31.4% to 76.7%. Scaffolding provided a substantial boost of over 20 percentage points for mid-capability models, but for the strongest models the gains were smaller—and in two cases performance actually degraded, suggesting that rigid prompts can interfere with certain model behaviors rather than complement them.

The benchmark also revealed that the hardest bugs are highly model-specific: different architectures failed on different problems, indicating that no single model currently closes all gaps. This work points to a need for better documentation-grounded semantic reasoning in AI systems. The authors have released the full benchmark, evaluation harness, and corpus to support further research into property-based testing and AI-driven software reliability.

Key Points
  • 100 curated property-based testing problems across 40 real Python libraries with 365 injected bugs
  • Best LLM achieved 83.4% bug recall with Hypothesis scaffolding; worst achieved 31.4% with open-ended prompts
  • Hypothesis scaffolding boosted mid-capability models by >20 percentage points but degraded performance in two top models

Why It Matters

Property-based testing is critical for software reliability; this benchmark exposes AI's weak semantic reasoning from documentation.