METR AI graph debunked: study plagued by severe methodological errors
Shocking flaws found in the most-cited AI timeline benchmark—human data guesstimated, not measured.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Nathan Witkin, a research writer at NYU Stern’s Tech and Society Lab, has published a devastating critique of the influential METR AI time horizons graph in his Substack Transformer. The graph, widely cited as evidence for rapid AI progress, relies on METR’s Long Tasks benchmark. Witkin argues that the benchmark is so riddled with compounding errors that no meaningful conclusions can be drawn. Among the most damning flaws: human baseline data was often guesstimated rather than empirically measured; when measured, human benchmarkers were paid hourly, incentivizing them to work more slowly—a direct confound. The sample of human testers was drawn from METR employees' friends and acquaintances, introducing severe selection bias. Additionally, humans who were already familiar with a codebase completed tasks 5–18x faster, but METR used data from slower, unfamiliar workers.
Further errors include test-training data contamination: many of the tasks had published solutions online that likely appeared in LLM training sets. Witkin notes that these problems likely compound in unpredictable ways. He calls for the field to abandon the graph and pursue higher-quality information, criticizing the broader AI research practice of overindexing on anecdotal data from power-users and compromised benchmarks. The critique echoes earlier work by Gary Marcus and Ernest Davis, who also highlighted additional errors. Witkin’s post underscores why rigorous peer review and scientific standards are essential to prevent policymakers and researchers from relying on superficially scientific but deeply flawed information.
- Human baseline speeds were often guesstimated, not empirically measured, undermining the entire comparison.
- Human benchmarkers were paid hourly, creating a financial incentive to work slowly rather than efficiently.
- Task familiarity gave expert humans 5–18x speed advantage, but METR used unfamiliar workers, biasing results.
Why It Matters
This undermines one of the most cited benchmarks for AI progress, potentially misleading policymakers and researchers about timelines.