Research & Papers

Uber's Failover Architecture: Reconciling Reliability and Efficiency in Hyperscale Microservice Infrastructure

Uber's new failover system ditches the costly 2x capacity model, saving over a million CPU cores while maintaining 99.97% availability.

Deep Dive

Uber's engineering team, led by Mayank Bansal and Milind Chabbi, has published a paper detailing their new Uber Failover Architecture (UFA). This system fundamentally rethinks how the ride-sharing giant ensures platform resilience at a global scale. Historically, Uber ensured reliability through a costly '2x capacity' model, where every service was provisioned to handle global traffic independently across two separate regions. This left half of the massive compute fleet idle as a buffer, resulting in low utilization and high infrastructure costs. UFA replaces this one-size-fits-all approach with a differentiated architecture that aligns resource allocation with business criticality.

Critical services, like dispatch and payments, retain strong failover guarantees with dedicated buffer capacity. However, non-critical services now opportunistically use the buffer capacity reserved for critical services during normal operations. During rare 'full-peak' failover events, these non-critical services are selectively preempted and then rapidly restored using on-demand cloud capacity, with differentiated Service-Level Agreements (SLAs) governing their recovery. Automated safeguards, including dependency analysis and regression gates, ensure that critical services continue to function even when non-critical ones are temporarily unavailable. The quantitative impact is massive: UFA has already hardened over 4,000 unsafe dependencies and eliminated over one million CPU cores from a baseline of about four million, representing a major win for both engineering efficiency and the company's bottom line.

Key Points
  • Replaces costly 2x capacity model with a tiered system based on service criticality, cutting steady-state provisioning to 1.3x.
  • Eliminated over 1 million CPU cores from a ~4 million core baseline, boosting fleet utilization from ~20% to ~30%.
  • Maintains 99.97% availability for critical services using automated safeguards and selective preemption of non-critical workloads.

Why It Matters

Shows how hyperscale companies can achieve massive cost savings without sacrificing reliability, a blueprint for efficient cloud infrastructure.