A Practical Framework for Flaky Failure Triage in Distributed Database Continuous Integration
New AI system decides if a database test failure is real or flaky in under 2 milliseconds.
A team of researchers has introduced SCOUT, a novel framework designed to tackle the critical problem of flaky test failures in the continuous integration (CI) pipelines of distributed databases like TiDB. In these high-pressure environments, operators must instantly decide whether a test failure is a transient "flaky" issue—which might pass on a rerun—or a genuine "persistent" bug that needs immediate developer attention. Existing solutions often fail because they rely on data unavailable at decision time, produce unreliable scores, or learn from biased labels. SCOUT addresses this by using only strict-causal features (pre-failure telemetry and historical data) to make real-time, online predictions without any lookahead, ensuring decisions are both fast and practical for deployment.
The framework's core innovation is a three-part approach: lightweight state-aware scoring, post-hoc calibration to maintain accuracy across shifting workloads, and a correction technique to reduce bias from finite rerun policies. Evaluated on a benchmark of 3,680 failed runs with 462 confirmed flaky failures, SCOUT demonstrated high effectiveness. Crucially, it was deployed in a production environment, where it processed decisions with an end-to-end latency of just 1.17 milliseconds at the 95th percentile, operating on CPU-only budgets. This proves its feasibility for real-world, large-scale systems, as further validated on metadata from TiDB v7/v8 and extensive GitHub Actions traces.
- SCOUT makes triage decisions using only pre-failure data, achieving a P95 latency of 1.17 ms on CPU.
- The framework was evaluated on 3,680 failed CI runs and validated on production TiDB and GitHub Actions data.
- It uses post-hoc calibration and bias correction to handle telemetry shifts and finite rerun policy labels.
Why It Matters
This drastically reduces engineering time wasted on investigating false alarms in critical database deployment pipelines.