Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios
AI agents now autonomously complete 22 of 32 attack steps, matching 6 hours of human expert work in minutes.
A research team from Anthropic and other institutions has published a landmark study measuring the autonomous cyber-attack capabilities of frontier AI models. The paper, "Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios," tested seven models released between August 2024 and February 2026 on two purpose-built cyber ranges: a 32-step corporate network attack and a 7-step industrial control system (ICS) attack. These scenarios require chaining heterogeneous capabilities across extended action sequences, simulating real-world attack chains.
Key findings reveal two significant trends. First, model performance scales log-linearly with inference-time compute, with no observed plateau—increasing from 10M to 100M tokens yields gains of up to 59%, requiring no specific technical sophistication from the operator. Second, each successive model generation outperforms its predecessor at fixed token budgets. On the corporate network range, average steps completed at 10M tokens rose dramatically from 1.7 (GPT-4o, August 2024) to 9.8 (Opus 4.6, February 2026).
The best single run completed 22 of 32 steps, corresponding to roughly 6 of the estimated 14 hours a human expert would need for the same task. On the industrial control system range, performance remains more limited, though the most recent models are the first to reliably complete steps, averaging 1.2-1.4 of 7 steps (maximum 3). This demonstrates both rapid capability advancement and persistent challenges in specialized domains like ICS security.
The study's methodology represents a significant advancement in AI safety evaluation, moving beyond static benchmarks to dynamic, multi-step scenarios that better reflect real-world threats. The consistent scaling laws observed suggest that without intervention, future models will continue to improve at autonomous cyber operations, raising important questions about defensive capabilities and governance frameworks for advanced AI systems.
- Performance scales log-linearly with compute: 100M tokens yields 59% gains over 10M tokens with no plateau
- Rapid generational improvement: Steps completed rose from 1.7 (GPT-4o) to 9.8 (Opus 4.6) at fixed 10M token budget
- Best run completed 22/32 steps, equivalent to 6 hours of human expert work in automated execution
Why It Matters
Demonstrates AI's rapidly advancing autonomous cyber capabilities, forcing security teams to prepare for AI-powered attacks and defenses.