CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload
A new Transformer model analyzes 206 system metrics per second to forecast VM performance degradation in public clouds.
A team from EPFL and Universidad Politécnica de Madrid has introduced CloudFormer, a novel AI model designed to solve a critical problem in public cloud infrastructure: unpredictable performance degradation. In multi-tenant environments, virtual machines (VMs) compete for shared physical resources like last-level cache and memory bandwidth, causing severe slowdowns that are hard to predict. CloudFormer uses a dual-branch Transformer architecture—a type of neural network adept at handling sequences and relationships—to jointly model the temporal dynamics of workloads and complex system-level interactions. It processes a fine-grained dataset of 206 different system metrics collected every second, allowing it to capture transient interference effects that other models miss.
This high-resolution analysis enables CloudFormer to adapt to varying, previously unseen workload conditions without requiring scenario-specific tuning, addressing the 'black-box' challenge of public clouds. The researchers' experiments show the model achieving a remarkably low mean absolute error (MAE) of 7.8% in predicting performance degradation, which represents at least a 28% improvement over current state-of-the-art methods. Alongside the model, the team is releasing an expansive, high-resolution dataset that significantly advances available benchmarks for this field. The work demonstrates robust generalization, meaning the AI's predictions remain accurate even for diverse and unknown workloads, a key requirement for real-world cloud management.
- Uses a dual-branch Transformer to model temporal and system-level data from 206 metrics per second.
- Achieves a 7.8% mean absolute error, outperforming existing prediction methods by at least 28%.
- Generalizes to unknown workloads without specific tuning, solving a key 'black-box' cloud challenge.
Why It Matters
Enables cloud providers and enterprises to proactively manage resources, reduce costs, and guarantee application performance.