Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
Dynamic tensor parallelism cuts latency contention, improving goodput 5.3x over existing systems.
Large language model (LLM) deployments increasingly serve a mix of latency-sensitive interactive requests and relaxed background workloads on the same GPU cluster. This tiered service-level objective (SLO) environment is challenging because workload mix, request lengths, and load intensity fluctuate over time. Existing serving systems primarily optimize request-level controls like queuing and batching while keeping the underlying execution configuration static. This limits their ability to adapt to multi-tier contention, leading to wasted GPU resources and SLO violations.
Enter Nitsum, a new distributed LLM serving system developed by researchers at the University of California, San Diego. Nitsum reimagines tensor parallelism (TP) — the way a model's layers are split across GPUs — as a first-class, dynamically adjustable runtime parameter. It jointly optimizes TP degree, the split of GPUs between prefill and decode phases, and request scheduling. To make frequent TP changes practical, Nitsum introduces TP-aware weight reuse to avoid redundant memory loads and a fast KV cache migration mechanism. In experiments with real-world traces and microbenchmarks, Nitsum achieved up to 5.3x improvement in SLO-compliant goodput over current state-of-the-art systems, demonstrating significant gains in GPU utilization and request throughput without sacrificing latency targets.
- Nitsum treats tensor parallelism as a dynamic runtime control, unlike existing systems that set TP statically.
- Jointly optimizes TP degree, prefill/decode GPU split, and request scheduling under fluctuating multi-tenant workloads.
- Achieves up to 5.3x improvement in SLO-compliant goodput over state-of-the-art via TP-aware weight reuse and fast KV migration.
Why It Matters
Makes GPU clusters far more efficient for mixed LLM workloads, reducing costs and latency violations.