Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis
Researchers' new system automates root cause diagnosis in complex microservices, cutting resolution times for SRE teams.
A research team from Boston University and other institutions has introduced Praxium, a novel AI framework designed to tackle the growing problem of diagnosing performance anomalies in complex cloud microservice architectures. The system addresses a critical pain point: as organizations adopt continuous integration/deployment (CI/CD) pipelines, traditional manual diagnosis by experts becomes unscalable. Praxium works by continuously monitoring telemetry data and correlating it with dependency installation information from a companion software discovery tool called PraxiPaaS.
When an anomaly is detected, Praxium performs causal impact analysis to determine whether recent software installations or rollouts are the root cause. The researchers demonstrated impressive results across 75 trials using four different types of synthetic anomalies, with the system consistently achieving a macro-F1 score greater than 0.97 for detection accuracy. Crucially, the causal analysis component reliably identified the correct root cause even as package installations occurred at increasingly shorter intervals, mimicking real-world deployment frequencies.
The framework represents a significant shift from reactive, expert-dependent troubleshooting to automated, data-driven diagnosis. By providing site reliability engineers (SREs) with precise information about which installation caused an issue, Praxium could dramatically reduce mean time to resolution (MTTR) for cloud incidents. The paper also includes practical guidance on anomaly detection hyperparameter tuning, making the approach more accessible for production implementation in enterprise environments.
- Achieves >0.97 macro-F1 score for anomaly detection across 75 trials with four synthetic anomaly types
- Integrates with PraxiPaaS software discovery tool to analyze dependencies and installation impacts
- Reliably identifies root causes via causal analysis even with frequent package installations
Why It Matters
Automates critical SRE workflows, potentially slashing incident resolution times in complex cloud environments where manual diagnosis fails.