GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
New system from academic researchers cuts SLO violations by nearly half when co-serving Stable Diffusion and Sora-style models.
A research team led by Fanjiang Ye and 12 other authors from institutions including the University of Washington and University of Texas at Austin has introduced GENSERVE, a novel system designed to efficiently co-serve heterogeneous diffusion model workloads. The core challenge addressed is the significant difference in computational demands between text-to-image (T2I) models like Stable Diffusion and text-to-video (T2V) models like Sora, which leads to severe Service Level Objective (SLO) violations in current serving systems. GENSERVE's key innovation leverages the predictable, step-by-step nature of the diffusion process, treating each inference step as a natural preemption point for dynamic resource management.
GENSERVE implements three coordinated mechanisms for step-level adaptation: intelligent preemption for video generation tasks, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that optimizes allocation across all concurrent requests. This allows the system to dynamically adjust GPU resources between T2I and T2V jobs based on real-time demand and latency requirements. Experimental results demonstrate that GENSERVE achieves up to a 44% improvement in SLO attainment rate compared to the strongest existing baselines, significantly enhancing GPU cluster utilization for AI generation platforms that must handle both modalities.
- Leverages the predictable, step-wise diffusion process to enable fine-grained, preemptible resource scheduling between image and video generation tasks.
- Introduces three core mechanisms: intelligent video preemption, elastic sequence parallelism, and a joint SLO-aware scheduler for cross-request optimization.
- Improves Service Level Objective (SLO) attainment rate by up to 44% over current systems, drastically reducing latency violations for end-users.
Why It Matters
Enables AI platforms to serve more users concurrently with higher reliability, reducing costs and improving latency for image and video generation services.