Research & Papers

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

arXiv cs.DC April 07, 2026

⚡New system from academic researchers cuts SLO violations by nearly half when co-serving Stable Diffusion and Sora-style models.

Deep Dive

A research team led by Fanjiang Ye and 12 other authors from institutions including the University of Washington and University of Texas at Austin has introduced GENSERVE, a novel system designed to efficiently co-serve heterogeneous diffusion model workloads. The core challenge addressed is the significant difference in computational demands between text-to-image (T2I) models like Stable Diffusion and text-to-video (T2V) models like Sora, which leads to severe Service Level Objective (SLO) violations in current serving systems. GENSERVE's key innovation leverages the predictable, step-by-step nature of the diffusion process, treating each inference step as a natural preemption point for dynamic resource management.

GENSERVE implements three coordinated mechanisms for step-level adaptation: intelligent preemption for video generation tasks, elastic sequence parallelism with dynamic batching, and an SLO-aware scheduler that optimizes allocation across all concurrent requests. This allows the system to dynamically adjust GPU resources between T2I and T2V jobs based on real-time demand and latency requirements. Experimental results demonstrate that GENSERVE achieves up to a 44% improvement in SLO attainment rate compared to the strongest existing baselines, significantly enhancing GPU cluster utilization for AI generation platforms that must handle both modalities.

Key Points

Leverages the predictable, step-wise diffusion process to enable fine-grained, preemptible resource scheduling between image and video generation tasks.
Introduces three core mechanisms: intelligent video preemption, elastic sequence parallelism, and a joint SLO-aware scheduler for cross-request optimization.
Improves Service Level Objective (SLO) attainment rate by up to 44% over current systems, drastically reducing latency violations for end-users.

Why It Matters

Enables AI platforms to serve more users concurrently with higher reliability, reducing costs and improving latency for image and video generation services.

Read Original Article

GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads

Why It Matters

Stay Ahead in AI