Research & Papers

TAPER system boosts LLM serving goodput by 1.77x

TAPER dynamically regulates branch parallelism to avoid latency spikes.

Deep Dive

Existing LLM serving systems that expose intra-request parallelism (allowing independent branches to decode concurrently) suffer from a fundamental trade-off. Eager admission of all branches inflates the shared decode step, degrading co-batched requests in serial stages. Conversely, conservative fixed caps forfeit the throughput gains that motivated branching in the first place. This 'branch externality'—the excess step latency caused by admitted branches—depends on batch composition, context lengths, and accumulated slack, which change continuously over a workload trace.

TAPER, introduced by researchers including William J. Dally and Christos Kozyrakis, addresses this with per-step admission control. It treats extra branches as opportunistic work, only admitting them when the predicted branch externality fits within the batch's current slack budget. Per-step regulation is practical because branch-level scheduling decouples compute from memory—branches share the request's prefix KV, so expanding or contracting width requires no memory reclamation. On Qwen3-32B, TAPER improves goodput by 1.77× over IRP-Off and 1.48× over IRP-Eager, while maintaining over 95% SLO attainment, offering a practical path to efficient LLM serving at scale.

Key Points
  • TAPER improves goodput by 1.77× over no branch parallelism (IRP-Off) and 1.48× over eager admission (IRP-Eager) on Qwen3-32B.
  • Maintains over 95% service-level objective (SLO) attainment while dynamically regulating branch parallelism.
  • Uses per-step admission control, treating extra branches as opportunistic work admitted only when predicted latency fits within the batch's slack budget.

Why It Matters

Smarter LLM serving can cut costs and latency for AI applications at scale.