[D] How are you actually using AI in your research workflow these days?
New benchmark shows AI can now complete half of complex, hours-long research coding tasks.
Anthropic's latest Claude Opus 4.6 model has achieved a significant milestone, scoring 50% on METR's recently updated task horizon benchmark. This benchmark measures AI performance on expert-level, multi-hour tasks that mirror real research work, such as 'fixing a complex bug in an ML research codebase.' The result indicates that advanced AI models are now capable of completing half of these substantial, time-intensive challenges. While the performance bands are still wide and far from saturating the benchmark, the trend marks a concrete shift in practical utility. Researchers are now actively discussing what complex tasks they can reliably delegate to AI assistants versus where the technology still falls short in professional workflows.
- Claude Opus 4.6 scores 50% on METR's task horizon benchmark for multi-hour expert tasks
- Benchmark includes complex challenges like fixing bugs in ML research codebases
- Performance indicates a shift in what substantive work can be delegated to AI assistants
Why It Matters
AI is moving from simple assistance to handling substantive, hours-long expert tasks, changing research delegation.