Image & Video

From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

A new AI system uses graph embeddings to detect anomalies in live streaming services before they cause major incidents.

Deep Dive

A team of six Amazon researchers has developed a novel AI system that uses graph neural networks to detect anomalies in Prime Video's complex microservice architecture. The system, detailed in a paper accepted at FSE 2026, employs unsupervised node-level graph embeddings to learn structural representations from directed, weighted service graphs at minute-level resolution. By comparing embeddings from load tests with those from actual live events, it can identify services that behave differently under real-world traffic—something traditional load testing often misses.

Built on a GCN-GAE (Graph Convolutional Network - Graph Autoencoder) framework, the system flags anomalies based on cosine similarity between embeddings. During evaluation, it demonstrated 96% precision and an exceptionally low 0.08% false positive rate, though recall was limited to 58% under conservative propagation assumptions. The researchers also introduced a synthetic anomaly injection framework for controlled testing, showing the system's ability to detect incident-related services that were later documented in post-mortem analyses.

The technology has already proven practical within Prime Video, providing early detection capabilities during high-traffic events like Thursday Night Football streams and major video-on-demand premieres. While developed for Amazon's specific needs, the methodology provides a foundation that could be applied across other microservice ecosystems, offering a more sophisticated approach to reliability engineering than traditional monitoring tools.

Key Points
  • Uses GCN-GAE graph embeddings to compare load test vs. live traffic patterns with 96% precision
  • Achieves 0.08% false positive rate while detecting anomalies in minute-level service graphs
  • Successfully identified incident-related services during Prime Video's Thursday Night Football and Rings of Power events

Why It Matters

Enables more reliable streaming services by detecting subtle anomalies that traditional load testing misses, preventing outages during major events.