Guaranteeing Semantic and Performance Determinism in Flexible GPU Sharing
New system solves GPU sharing dilemma, guaranteeing identical results and predictable latency with zero code changes.
A team of researchers has introduced DetShare, a groundbreaking GPU sharing system that fundamentally changes how data centers can allocate expensive GPU resources. The system addresses a critical trade-off in current approaches: coarse-grained temporal multiplexing causes severe latency spikes for interactive AI services, while fine-grained spatial partitioning typically requires invasive kernel modifications that break behavioral equivalence. DetShare's key innovation is 'GPU coroutines,' a new abstraction that decouples logical execution contexts from physical GPU hardware. This enables flexible, fine-grained resource allocation through lightweight context migration while maintaining complete transparency—requiring zero code modifications to existing AI workloads.
DetShare delivers substantial performance improvements across both training and inference scenarios. In evaluations, the system boosted training throughput by up to 79.2% compared to traditional temporal sharing methods. For co-located workloads, it reduced P99 tail latency by 15.1% without sacrificing throughput. The system's workload-aware placement and TPOT-First scheduling policy proved particularly effective for inference, decreasing average latency by 69.1% and reducing Time-Per-Output-Token service level objective violations by 21.2% compared to default policies. This represents a significant advancement for running mixed AI workloads—like simultaneous training jobs and real-time inference services—on shared GPU infrastructure.
- Introduces 'GPU coroutines' abstraction enabling fine-grained sharing without kernel modifications or code changes
- Improves AI training throughput by 79.2% versus temporal sharing while reducing P99 tail latency by 15.1%
- Cuts average inference latency by 69.1% and reduces TPOT SLO violations by 21.2% with intelligent scheduling
Why It Matters
Enables data centers to run more AI workloads simultaneously on existing GPUs, dramatically improving utilization and reducing costs for training and inference.