Research & Papers

Microsoft Research finds LLMs degrade documents 19-34% over 20 delegated edits

LLMs can corrupt your documents when delegated multi-step tasks without human checks.

Deep Dive

Microsoft Research's recent paper, "LLMs Corrupt Your Documents When You Delegate," has sparked discussion about AI reliability in delegated workflows. Using the DELEGATE-52 benchmark—a stress test for long-horizon tasks—the study found that frontier models accumulate fidelity degradation over repeated edits. Across 20 delegated iterations, artifact fidelity dropped by 19-34%, though Python workflows proved far more robust with less than 1% degradation. The errors are semantic, not stylistic, and arise from sparse but consequential mistakes in multi-step transformation and inversion tasks. The researchers stress this is a diagnostic tool for examining delegation patterns, not a measure of overall model capability or user satisfaction.

However, the paper's findings come with important caveats. Production systems already mitigate these failures through verification loops, orchestration, domain-specific tooling, and human oversight. The study did not evaluate task completion or real-world workflows, but rather focused on a controlled setting with limited human intervention. The authors view these results as identifying open challenges for long-horizon reliability, not as evidence that AI lacks practical value today. They expect continued improvements in model training, memory systems, and agentic harnesses to further reduce these failure modes over time.

Key Points
  • Frontier LLMs show 19–34% degradation in artifact fidelity over 20 delegated iterations on the DELEGATE-52 benchmark.
  • Python workflows retained <1% degradation, indicating stronger robustness under extended delegated interactions.
  • The benchmark is a diagnostic stress test for long-horizon delegation, not a measure of real-world AI deployment reliability.

Why It Matters

Highlights a critical reliability gap in AI delegation, urging better verification and workflow-aware training for enterprise use.