Multi-Agent Systems for Root Cause Analysis in Microservices
Tree-structured search with reflection scores beats linear diagnostics in production microservices.
Root cause analysis (RCA) in microservice systems is notoriously difficult due to distributed logs and metrics. A new paper from researchers at the University of Oulu proposes LATS-RCA, a multi-agent framework that leverages large language models to automate RCA. Unlike prior linear approaches, LATS-RCA formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm. Multiple LLM-driven agents each analyze execution logs and performance metrics for a specific microservice, collecting evidence. Reflection scores derived from intermediate diagnostic states guide the search toward the most probable root cause, enabling a more thorough exploration of fault paths.
The framework was evaluated on two environments: Light-OAuth2 (LO2), a small-team Java system with homogeneous tech stack, and a production environment (Prod) from a case company with higher complexity. Results on LO2 show high diagnostic accuracy, while Prod shows lower accuracy and higher computational costs. The production deployment highlights real-world challenges such as polyglot technology stacks, inconsistent logging practices, and multi-factor root causes. The study demonstrates LATS-RCA's practical viability and sets a benchmark for future LLM-driven RCA tools in complex microservice architectures.
- LATS-RCA uses a Language Agent Tree Search algorithm that enables multiple LLM agents to explore root cause paths in parallel, guided by reflection scores.
- Evaluated on Light-OAuth2 (simple Java microservice) and a production environment, achieving high accuracy on LO2 and revealing scalability challenges in complex systems.
- Production deployment encountered polyglot tech stacks and varied logging practices, reducing accuracy compared to the homogeneous LO2 benchmark.
Why It Matters
Automates complex root cause analysis in microservices, cutting debugging time for DevOps teams in production environments.