TENT: A Declarative Slice Spraying Engine for Performant and Resilient Data Movement in Disaggregated LLM Serving
New system from industry researchers dynamically routes data across GPU links, cutting latency and self-healing in under 50ms.
A team of 19 researchers from major industrial AI sites has published TENT, a new data-movement engine designed to solve a critical bottleneck in large-scale LLM serving. Modern GPU clusters use a complex mix of high-speed interconnects like NVLink and RDMA, but existing frameworks use rigid, static routing. This creates communication silos, wastes bandwidth due to head-of-line blocking, and makes systems fragile to routine faults. TENT fundamentally changes this by decoupling an application's transfer intent from the physical execution.
Instead of locking workloads to fixed paths, TENT unifies all available network links into a single dynamic pool. Applications simply declare what data needs to move, and TENT's core innovation—"slice spraying"—takes over. It dynamically breaks large data flows into fine-grained slices and sprays them across all links based on instantaneous quality and congestion signals. This telemetry-driven approach eliminates blocking and enables transparent, sub-50 millisecond self-healing by rerouting slices around failures without any application logic changes. The system is already in production use for LLM inference and reinforcement learning pipelines.
Evaluations on NVIDIA H800 HGX clusters show significant performance gains. For LLM inference using the SGLang HiCache runtime, TENT delivered up to 1.36x higher throughput and a 26% reduction in the 90th percentile Time-To-First-Token latency compared to the state-of-the-art Mooncake TE engine. In RL training pipelines, it accelerated parameter updates in the Moonshot Checkpoint Engine by 20-26%. The work demonstrates that intelligent, dynamic data orchestration is a key lever for improving the efficiency and resilience of the massive GPU clusters powering modern AI.
- Dynamically sprays data slices across all GPU interconnects (NVLink, RDMA) using real-time telemetry, eliminating static routing bottlenecks.
- Enables sub-50ms self-healing by automatically rerouting data around failed links without application intervention.
- Achieved 1.36x higher throughput and 26% lower P90 latency for LLM inference vs. Mooncake TE in H800 cluster tests.
Why It Matters
Directly improves the efficiency and cost of running massive AI models by making better use of expensive GPU cluster hardware.