Spot-and-Scoot: Peeking Into Spot Instance Availability
A new technique probes AWS and Azure spot instance availability at near-zero cost, preventing costly workload interruptions.
A team of researchers has introduced Spot-and-Scoot (SnS), a novel method for predicting the availability of discounted cloud computing resources known as spot instances. Spot instances offer massive cost savings—up to 90% cheaper than on-demand pricing—but can be abruptly terminated by cloud providers when capacity is needed elsewhere. The core innovation of SnS is its cost-efficiency: instead of running expensive instances to monitor availability, it submits spot requests and cancels them immediately upon acceptance by the provider, collecting a binary 'available' or 'unavailable' signal at near-zero cost. This process leverages the cloud's provisioning lifecycle to peek at availability without incurring running instance charges.
The researchers validated SnS through a massive real-world experiment, submitting 336,033 spot requests across 68 different instance types and 15 regions on both AWS and Azure. Their analysis of 2,635 actual interruption events revealed a key pattern: over 92% of co-interruptions (multiple instances of the same type failing in the same zone) happen within a tight three-minute window. This finding justifies their binary availability model. From the SnS signals, they derived three complementary predictive features. When combined, these features achieved a high F1-macro score of 0.90 for modeling current availability and maintained a robust 0.85 score for predicting availability 60 minutes into the future.
Finally, the team demonstrated the practical value of SnS through trace-driven simulation using TPC-DS, a standard benchmark for decision support systems. The simulation showed that using SnS-based predictions to guide workload placement significantly reduces lost computation compared to an unguided, baseline approach. This proves the method's potential to help engineers and data scientists run large-scale, cost-sensitive batch jobs, AI training, or data processing workloads more reliably on spot infrastructure, maximizing savings while minimizing disruptive failures.
- Probes AWS & Azure spot instance availability at near-zero cost by canceling requests after acceptance, analyzing 336,033 requests across 68 types.
- Achieves 0.90 F1-macro score for current availability prediction and 0.85 for 60-minute forecasts, based on patterns from 2,635 real interruptions.
- Trace-driven simulation with TPC-DS workloads shows the method can significantly reduce lost computation compared to unguided baseline strategies.
Why It Matters
Enables reliable use of ultra-cheap cloud compute for AI training and big data jobs, preventing costly interruptions and maximizing savings.