Developer Tools

ReproFlake dataset offers 1,115 reproducible flaky tests for debugging

Finally, flaky test failures with scripts to reproduce and fix them.

Deep Dive

Flaky tests—tests that pass and fail non-deterministically on the same code—plague software development, yet reproducing their failures has been notoriously difficult. Researchers from academia (including Suzzana Rafi, August Shi, and Wing Lam) have released ReproFlake, a curated dataset of 1,115 reproducible flaky tests spanning four categories: async-wait, concurrency, order-dependent, and others. Unlike prior datasets that provided disjoint sets of tests or mere logs, ReproFlake includes full reproducible environments (e.g., Docker setups), scripts to trigger failures on demand, scripts to apply fixes and verify they eliminate flakiness, and detailed execution logs for both passing and failing runs. The team also established contribution guidelines to grow the dataset collaboratively.

Using ReproFlake, the researchers analyzed challenges in reproducing flaky tests—such as unresolved compilation failures when building legacy projects—and characterized typical fix locations, which can guide prioritization of repair efforts. They found that error information often helps identify the flaky test category, and knowing where fixes commonly reside (e.g., in test code vs. production code) can streamline debugging. This dataset is a significant step toward making flaky-test research more rigorous and actionable, enabling both academic studies and industrial tooling to benefit from a standardized, reproducible benchmark.

Key Points
  • 1,115 flaky tests across four categories: async-wait, concurrency, order-dependent, and other
  • Includes reproducible environments (Docker), failure reproduction scripts, fix automation scripts, and execution logs
  • Contribution guidelines allow the community to expand the dataset collaboratively

Why It Matters

Reproducible flaky tests let teams systematically debug nondeterministic failures, improving software reliability and CI pipeline stability.