Human baseline speeds were often guesstimated, not empirically measured, undermining the entire comparison?

Human baseline speeds were often guesstimated, not empirically measured, undermining the entire comparison.

Human benchmarkers were paid hourly, creating a financial incentive to work slowly rather than efficiently?

Human benchmarkers were paid hourly, creating a financial incentive to work slowly rather than efficiently.

Task familiarity gave expert humans 5–18x speed advantage, but METR used unfamiliar workers, biasing results?

Task familiarity gave expert humans 5–18x speed advantage, but METR used unfamiliar workers, biasing results.

Research & Papers

METR AI graph debunked: study plagued by severe methodological errors

r/MachineLearning May 26, 2026

⚡Shocking flaws found in the most-cited AI timeline benchmark—human data guesstimated, not measured.

Deep Dive

Nathan Witkin, a research writer at NYU Stern’s Tech and Society Lab, has published a devastating critique of the influential METR AI time horizons graph in his Substack Transformer. The graph, widely cited as evidence for rapid AI progress, relies on METR’s Long Tasks benchmark. Witkin argues that the benchmark is so riddled with compounding errors that no meaningful conclusions can be drawn. Among the most damning flaws: human baseline data was often guesstimated rather than empirically measured; when measured, human benchmarkers were paid hourly, incentivizing them to work more slowly—a direct confound. The sample of human testers was drawn from METR employees' friends and acquaintances, introducing severe selection bias. Additionally, humans who were already familiar with a codebase completed tasks 5–18x faster, but METR used data from slower, unfamiliar workers.

Further errors include test-training data contamination: many of the tasks had published solutions online that likely appeared in LLM training sets. Witkin notes that these problems likely compound in unpredictable ways. He calls for the field to abandon the graph and pursue higher-quality information, criticizing the broader AI research practice of overindexing on anecdotal data from power-users and compromised benchmarks. The critique echoes earlier work by Gary Marcus and Ernest Davis, who also highlighted additional errors. Witkin’s post underscores why rigorous peer review and scientific standards are essential to prevent policymakers and researchers from relying on superficially scientific but deeply flawed information.

Key Points

Human baseline speeds were often guesstimated, not empirically measured, undermining the entire comparison.
Human benchmarkers were paid hourly, creating a financial incentive to work slowly rather than efficiently.
Task familiarity gave expert humans 5–18x speed advantage, but METR used unfamiliar workers, biasing results.

Why It Matters

This undermines one of the most cited benchmarks for AI progress, potentially misleading policymakers and researchers about timelines.

Read Original Article

METR AI graph debunked: study plagued by severe methodological errors

Why It Matters

Related Articles

🚀 Stay Ahead in AI