AI Safety

5 Hypotheses for Why Models Fail on Long Tasks

New analysis reveals why GPT-4 and Claude struggle with tasks lasting hours versus minutes.

Deep Dive

A new analysis from LawrenceC at METR (Model Evaluation and Threat Research) identifies five key reasons why current AI models like GPT-4 and Claude consistently underperform on longer-duration tasks compared to humans. The research explains the phenomenon behind METR's time horizon results, where model capability is measured by the length of tasks they can successfully complete. While the obvious explanation is training data bias—models see more short examples than long ones—the analysis focuses on mechanistic reasons why longer tasks remain genuinely harder for deployed systems.

First, longer tasks require subjective judgment and 'taste' that models lack, particularly for software design and stakeholder communication where success criteria aren't easily scored. Second, extended tasks often demand narrow procedural expertise in fields like cryptography or ML that may fall outside a model's core knowledge. Third, the compounding effect of stochastic failures becomes critical—if a model has even a 1% error rate per step, completing 100-step tasks becomes nearly impossible. These insights help explain why current models handle 1-minute coding functions well but struggle with 8-hour software projects.

Key Points
  • Longer tasks require subjective 'taste' and judgment that current models lack, particularly for software design and communication
  • Extended duration work often needs narrow procedural expertise (like cryptography) that may be outside a model's training distribution
  • Compounding error rates make long tasks exponentially harder—a 1% per-step failure rate makes 100-step tasks nearly impossible

Why It Matters

Understanding these limitations is crucial for forecasting AI capabilities and developing models that can handle real-world professional work requiring sustained focus.