METR evaluated an early version of Claude Mythos
AI risk evaluator METR finds Claude Mythos hits ceiling of current tests with 16-hour task horizon
METR (Model Evaluation and Threat Research) assessed an early version of Claude Mythos Preview during a limited window in March 2026, using their specialized time-horizon methodology. This metric estimates how long an AI can sustain autonomous operation on challenging tasks before requiring human intervention. The results place Claude Mythos at the very top of METR's current measurement scale: a 50%-time-horizon of at least 16 hours (95% CI 8.5–55 hours). That means the model could independently complete tasks that take an expert human up to 16 hours, though the wide confidence interval reflects statistical uncertainty.
However, METR acknowledges significant limitations. Their task suite contains 228 items, but only 5 are estimated to require 16+ hours of sustained work. This sparse coverage makes precise quantitative comparisons unreliable—measurements at this range become unstable and less meaningful. METR explicitly warns against extrapolating or using these numbers for direct comparisons with other models. They note that while the suite could still distinguish an even more capable model from current state-of-the-art, they are actively developing updated methods with longer tasks to improve robustness. Until then, they advise caution in interpreting recent time-horizon numbers.
- Estimated 50%-time-horizon of at least 16 hours (95% CI 8.5–55 hours), at the upper limit of METR's current measurement capability.
- Only 5 out of 228 tasks in the suite are 16+ hours long, making measurements unstable and less meaningful at that range.
- METR advises caution in precise quantitative comparisons and is developing longer tasks for more robust future evaluations.
Why It Matters
As AI capabilities advance, evaluation methods must keep up; current tests may already be inadequate for frontier models like Claude Mythos.