Developer Tools

Quantifying the Expectation-Realisation Gap for Agentic AI Systems

Research shows developers expecting 24% speedup actually experienced 19% slowdown with AI tools.

Deep Dive

A new research paper titled 'Quantifying the Expectation-Realisation Gap for Agentic AI Systems' by Sebastian Lobentanzer provides rigorous empirical evidence that agentic AI systems consistently underperform relative to pre-deployment expectations. The study reviews controlled trials across software engineering, clinical documentation, and clinical decision support, revealing systematic discrepancies between vendor claims and measured outcomes. In software development, the gap was particularly stark: experienced developers expected a 24% speedup from AI tools but were actually slowed by 19%, resulting in a 43 percentage-point calibration error. This challenges the prevailing narrative of immediate, transformative productivity gains from AI agents.

The research identifies several key drivers of this performance gap, including workflow integration friction, verification burden, measurement construct mismatches, and systematic heterogeneity in treatment effects. In clinical documentation, vendor claims of multi-minute time savings per note contrast with measured reductions of less than one minute, with one widely deployed tool showing no statistically significant effect at all. For clinical decision support systems, externally validated performance falls substantially below developer-reported metrics. The evidence suggests that current deployment practices often overlook the human oversight costs and integration challenges, motivating the need for structured planning frameworks that require explicit, quantified benefit expectations with these factors properly accounted for.

Key Points
  • Experienced developers expecting 24% speedup from AI tools were actually slowed by 19%—a 43 percentage-point gap
  • Clinical documentation tools showed minimal time savings (<1 minute per note), with one widely deployed tool having no statistically significant effect
  • Performance gaps driven by workflow friction, verification burden, and measurement mismatches between vendor claims and real-world use

Why It Matters

This research provides crucial data for realistic AI deployment planning and challenges inflated vendor claims about productivity gains.