Research & Papers

AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

New multi-agent system learns from failed cloud operations, achieving 66.3% success rate on diagnostic tasks.

Deep Dive

A research team led by Pei Yang has introduced AOI (Autonomous Operations Intelligence), a breakthrough framework that addresses critical barriers to deploying LLM agents in enterprise Site Reliability Engineering (SRE). The system tackles three major challenges: restricted access to proprietary data, unsafe action execution in permission-governed environments, and the inability of closed systems to learn from failures. AOI formulates automated operations as a structured trajectory learning problem under security constraints, integrating a trainable diagnostic system with a novel read-write separated execution architecture that prevents unauthorized state mutations while allowing safe learning.

The technical core includes Group Relative Policy Optimization (GRPO) to distill expert knowledge into locally deployed open-source models, enabling preference-based learning without data exposure. Most innovatively, AOI's Failure Trajectory Closed-Loop Evolver mines unsuccessful operational attempts and converts them into corrective supervision signals for continual data augmentation. On the AIOpsLab benchmark, AOI achieved 66.3% best@5 success across 86 tasks, outperforming prior state-of-the-art by 24.4 percentage points. A locally deployed 14B parameter model trained with Observer GRPO reached 42.9% avg@1 on 63 held-out tasks with unseen fault types, surpassing Claude Sonnet 4.5. The Evolver component demonstrated practical impact by converting 37 failed trajectories into diagnostic guidance, improving end-to-end avg@5 performance by 4.8 points while reducing variance by 35%.

Key Points
  • AOI framework achieves 66.3% best@5 success on 86 cloud diagnostic tasks, beating previous SOTA by 24.4 percentage points
  • Locally deployed 14B model with GRPO training reaches 42.9% avg@1 on unseen fault types, outperforming Claude Sonnet 4.5
  • Failure Trajectory Evolver converts 37 failed operations into training signals, improving avg@5 by 4.8 points with 35% variance reduction

Why It Matters

Enables enterprises to deploy secure, self-improving AI for cloud operations without exposing sensitive data or risking system stability.