CI-Repair-Bench includes 567 CI failure instances from 103 GitHub repositories, categorized into 12 CI-specific error types?

CI-Repair-Bench includes 567 CI failure instances from 103 GitHub repositories, categorized into 12 CI-specific error types.

Repair correctness is evaluated by full re-execution of the original GitHub Actions workflows, not just unit tests?

Repair correctness is evaluated by full re-execution of the original GitHub Actions workflows, not just unit tests.

Best LLM achieved only 18.9% repair success rate overall; automated repair struggles most with environment and dependency failures?

Best LLM achieved only 18.9% repair success rate overall; automated repair struggles most with environment and dependency failures.

Developer Tools

CI-Repair-Bench: New benchmark exposes LLM limits in fixing CI failures

arXiv cs.SE May 01, 2026

⚡567 real failures, 12 error types, best LLM only 18.9% success rate.

Deep Dive

Researchers from Concordia University have released CI-Repair-Bench, a repository-aware benchmark for automated patch validation that uses real Continuous Integration (CI) workflows from GitHub Actions. The benchmark addresses a critical gap in program repair research: existing benchmarks focus on source-code-level test failures and ignore the complex, multi-stage CI pipeline. CI-Repair-Bench contains 567 CI failure instances drawn from 103 open-source repositories, each failure categorized into one of 12 distinct error types such as formatting violations, dependency conflicts, environment misconfigurations, and test flakiness. Repairs are validated by re-running the original CI workflow, ensuring correctness is measured in a realistic, production-like setting.

The benchmark's evaluation of state-of-the-art LLM-based repair systems reveals stark limitations. While automated repair excels at simple, tool-enforced issues like linting and formatting (often hitting >90% success), it struggles with environment-specific failures such as missing runtime dependencies or configuration mismatches. The best-performing LLM (likely GPT-4) achieved only an 18.9% overall repair success rate. These results underscore how far autonomous CI repair has to go, and provide a rigorous foundation for future research. The paper also includes a reference CI repair workflow that parses logs, localizes faults, and generates candidate patches – giving practitioners a starting point for building real-world CI debugging tools.

Key Points

CI-Repair-Bench includes 567 CI failure instances from 103 GitHub repositories, categorized into 12 CI-specific error types.
Repair correctness is evaluated by full re-execution of the original GitHub Actions workflows, not just unit tests.
Best LLM achieved only 18.9% repair success rate overall; automated repair struggles most with environment and dependency failures.

Why It Matters

CI-Repair-Bench provides a realistic yardstick for building AI tools that can actually debug real-world build and deployment pipelines.

Read Original Article

CI-Repair-Bench: New benchmark exposes LLM limits in fixing CI failures

Why It Matters

Related Articles

🚀 Stay Ahead in AI