Research & Papers

New mechanisms boost fault resilience in NVIDIA MPS GPU sharing

A single GPU fault no longer crashes all co-running processes in NVIDIA MPS.

Deep Dive

NVIDIA Multi-Process Service (MPS) allows multiple processes to share a single GPU simultaneously, improving utilization in data centers. However, MPS has a critical weakness: any GPU fault in one process kills all co-running processes, making it unreliable for multi-tenant or resilience-critical environments. To address this, researchers first conducted a systematic characterization of GPU faults and their end-to-end processing pipeline, identifying the dominant fault types and where they occur.

Based on these insights, they designed two complementary mechanisms for fault-resilient MPS. The first is a fault isolation mechanism specifically for memory-related faults—the most common type—which can be fully handled by software intervention in the open GPU driver kernel module. The second mechanism, fast recovery, targets faults whose processing path lies within proprietary software components. It uses virtual memory-based GPU-resident state sharing to quickly recover without full process restart.

Evaluated on various GPU architectures and workloads, both mechanisms demonstrate effective fault handling with minimal performance overhead. This approach enables MPS to maintain high GPU utilization while providing the fault resilience needed for shared infrastructure. The work opens the door for safer deployment of GPU sharing in cloud and HPC environments.

Key Points
  • Fault isolation for memory-related GPU faults via open GPU driver kernel module
  • Fast recovery mechanism using virtual memory-based GPU-resident state sharing for other faults
  • Evaluation shows effective fault handling with minimal overhead across multiple GPUs and workloads

Why It Matters

Improves reliability of GPU sharing, enabling safer multi-tenant clusters and higher utilization without fault contagion.