Research & Papers

LLMs Corrupt Your Documents When You Delegate

Frontier models like GPT-5.4 and Claude 4.6 Opus introduce silent errors that compound over time.

Deep Dive

A new research paper from Stanford and Microsoft researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville reveals a critical flaw in using Large Language Models (LLMs) for delegated knowledge work. The team created DELEGATE-52, a benchmark simulating long, complex editing workflows across 52 professional domains like coding, crystallography, and music notation. Their large-scale experiment tested 19 LLMs, including frontier models like OpenAI's GPT-5.4, Anthropic's Claude 4.6 Opus, and Google's Gemini 3.1 Pro. The alarming finding: even these top models corrupt an average of 25% of document content by the end of extended delegated tasks, with performance degrading further for other models.

Additional experiments showed that giving models agentic capabilities—the ability to use tools and take actions—did not improve their performance on DELEGATE-52. The degradation severity worsens with larger document sizes, longer interaction chains, or the presence of distractor files. The analysis indicates that LLMs introduce sparse but severe errors that silently corrupt documents, and these errors compound over sequential interactions. This creates a significant trust issue for professionals relying on AI for tasks like 'vibe coding' or complex document editing, as the final output may contain hidden, cascading inaccuracies.

Key Points
  • DELEGATE-52 benchmark tests 19 LLMs across 52 professional editing domains, revealing systemic unreliability.
  • Even frontier models (GPT-5.4, Claude 4.6 Opus, Gemini 3.1 Pro) corrupt 25% of content in long workflows.
  • Agentic tool use fails to solve the problem; errors are sparse, severe, and compound silently over time.

Why It Matters

Professionals delegating complex editing to AI risk silent, compounding errors in critical documents, undermining trust in automated workflows.