Developer Tools

ScarfBench: New benchmark exposes AI agents' 15% success rate in Java migration

Top AI agents fail 85% of enterprise Java framework migration tasks...

Deep Dive

ScarfBench, introduced by researchers from multiple institutions, addresses a critical gap in software engineering benchmarks: measuring AI agents' ability to perform cross-framework migration of enterprise Java applications. The benchmark comprises 34 real-world enterprise applications, each with expert-written implementation triples across Spring, Jakarta EE, and Quarkus frameworks. This yields 102 variants (~151K lines of code across 1946 source and test files) and 204 directed refactoring tasks. Each task requires an AI agent to take a working source application and a target framework specification, then synthesize a target implementation that preserves all original behavior. Correctness is evaluated by an executable oracle that checks compilation, deployment in a containerized runtime, and behavioral test passing.

When evaluating five state-of-the-art coding agents on ScarfBench, results were sobering: the strongest agent (likely a large language model-based code assistant) achieved only 15.3% aggregate test pass on focused-layer migrations and 12.2% on whole applications. Across all 204 tasks, only one submission yielded a fully behaviorally equivalent target. The difficulty is asymmetric—migrations between Spring and Quarkus are most tractable, while Jakarta-targeted migrations are hardest. Analysis of failure traces (using LLM-as-judge and expert adjudication) revealed a taxonomy of recurring failure categories spanning build, deploy, and test stages. The benchmark, harness, and agent traces are publicly released, providing a rigorous test for future cross-framework migration tools.

Key Points
  • 34 enterprise Java applications with 102 framework variants across Spring, Jakarta EE, and Quarkus (151K lines of code)
  • 204 directed refactoring tasks; best agent achieves 15.3% pass on single-layer and 12.2% on whole applications
  • Only 1 out of 204 tasks produced a fully behaviorally equivalent migration, with Jakarta-targeted migrations hardest

Why It Matters

ScarfBench sets a rigorous standard for measuring AI's ability to modernize legacy enterprise Java applications—a massive industry need.