Open Source

Zai's ZCube network cuts costs 33% and boosts GLM-5.1 inference throughput 15%

New network architecture slashes costs by a third while speeding up AI inference.

Deep Dive

Zai, partnering with Tsinghua University and HarnetsAI, has redesigned the network architecture powering a thousand-GPU cluster running GLM-5.1 coding inference. They replaced the standard ROFT topology with a new system called ZCube, targeting the unique traffic patterns of Prefill-Decode (PD) disaggregated inference. In PD inference, KV Cache transfers create highly asymmetric traffic between nodes, causing hotspots and PFC backpressure on ROFT's static rail mapping. ZCube addresses this by fully flattening the network: removing the Spine layer and using a complete bipartite interconnect between two switch groups. This eliminates a whole category of congestion that ROFT cannot avoid by design.

The production results are striking. Switch and optical module costs dropped 33%, GPU inference throughput increased 15%, and P99 tail latency on first token improved by 40.6%—all using the same GPUs, same software stack, and same model. Typically, better network hardware costs more, but Zai achieved higher performance at lower cost through smarter topology. This showcases how infrastructure-level innovation can yield significant gains without upgrading compute hardware, a critical advantage for scaling AI inference economically.

Key Points
  • ZCube reduces network hardware costs by 33% while improving GPU inference throughput by 15%.
  • P99 tail latency on first token dropped 40.6% due to the flattened bipartite interconnect.
  • The architecture eliminates Spine-layer congestion that plagues ROFT during PD-disaggregated inference.

Why It Matters

Zai proves smarter network design can cut costs and boost AI inference performance simultaneously.