34 enterprise Java applications with 102 framework variants across Spring, Jakarta EE, and Quarkus (151K lines of code)?

34 enterprise Java applications with 102 framework variants across Spring, Jakarta EE, and Quarkus (151K lines of code)

204 directed refactoring tasks; best agent achieves 15.3% pass on single-layer and 12.2% on whole applications?

204 directed refactoring tasks; best agent achieves 15.3% pass on single-layer and 12.2% on whole applications

Only 1 out of 204 tasks produced a fully behaviorally equivalent migration, with Jakarta-targeted migrations hardest?

Only 1 out of 204 tasks produced a fully behaviorally equivalent migration, with Jakarta-targeted migrations hardest

Developer Tools

ScarfBench: New benchmark exposes AI agents' 15% success rate in Java migration

arXiv cs.SE May 11, 2026

⚡Top AI agents fail 85% of enterprise Java framework migration tasks...

Deep Dive

ScarfBench, introduced by researchers from multiple institutions, addresses a critical gap in software engineering benchmarks: measuring AI agents' ability to perform cross-framework migration of enterprise Java applications. The benchmark comprises 34 real-world enterprise applications, each with expert-written implementation triples across Spring, Jakarta EE, and Quarkus frameworks. This yields 102 variants (~151K lines of code across 1946 source and test files) and 204 directed refactoring tasks. Each task requires an AI agent to take a working source application and a target framework specification, then synthesize a target implementation that preserves all original behavior. Correctness is evaluated by an executable oracle that checks compilation, deployment in a containerized runtime, and behavioral test passing.

When evaluating five state-of-the-art coding agents on ScarfBench, results were sobering: the strongest agent (likely a large language model-based code assistant) achieved only 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications. Across all 204 tasks, only one submission yielded a fully behaviorally equivalent target. The difficulty is asymmetric—migrations between Spring and Quarkus are most tractable, while Jakarta-targeted migrations are hardest. Analysis of failure traces (using LLM-as-judge and expert adjudication) revealed a taxonomy of recurring failure categories spanning build, deploy, and test stages. The benchmark, harness, and agent traces are publicly released, providing a rigorous test for future cross-framework migration tools.

Key Points

34 enterprise Java applications with 102 framework variants across Spring, Jakarta EE, and Quarkus (151K lines of code)
204 directed refactoring tasks; best agent achieves 15.3% pass on single-layer and 12.2% on whole applications
Only 1 out of 204 tasks produced a fully behaviorally equivalent migration, with Jakarta-targeted migrations hardest

Why It Matters

ScarfBench sets a rigorous standard for measuring AI's ability to modernize legacy enterprise Java applications—a massive industry need.

Read Original Article

ScarfBench: New benchmark exposes AI agents' 15% success rate in Java migration

Why It Matters

Related Articles

🚀 Stay Ahead in AI