Zhipu AI deployed ZCube in GLM-5.1 clusters, boosting inference throughput by 15% and cutting P99 latency by 40.6%?

Zhipu AI deployed ZCube in GLM-5.1 clusters, boosting inference throughput by 15% and cutting P99 latency by 40.6%.

ZCube reconfigures physical topology to flatten pyramid-like Clos networks, reducing switch costs by roughly one-third?

ZCube reconfigures physical topology to flatten pyramid-like Clos networks, reducing switch costs by roughly one-third.

OpenAI's MRC protocol (with Nvidia, AMD, Intel, Microsoft, Broadcom) also targets GPU-to-GPU data transfer inefficiencies?

OpenAI's MRC protocol (with Nvidia, AMD, Intel, Microsoft, Broadcom) also targets GPU-to-GPU data transfer inefficiencies.

Viral Wire

OpenAI and Zhipu AI release new network architectures to boost AI inference efficiency

BigGo Finance May 21, 2026

⚡New ZCube architecture cuts latency by 40% and switch costs by a third in production clusters.

Deep Dive

On May 5, OpenAI joined forces with Nvidia, AMD, Intel, Microsoft, and Broadcom to release the MRC network transmission protocol, targeting data transfer inefficiencies between GPUs in large-scale AI clusters. Just two weeks later, Chinese AI unicorn Zhipu AI, together with Tsinghua University and Yusur Network, announced the large-scale deployment of ZCube network architecture in its GLM-5.1 online production inference cluster. The ZCube architecture reconfigures physical topology, achieving a 15% boost in GPU inference throughput at thousand-GPU scale, a 40.6% reduction in P99 time-to-first-token latency, and roughly one-third cost savings on switches and optical transceivers. This marks a turning point where the China-U.S. AI infrastructure race moves from GPU-count competition to network efficiency, especially critical for China facing high-end chip restrictions.

Traditional Clos network architectures struggle with asymmetric traffic patterns in Prefill-Decode disaggregated inference, causing GPUs to wait for data. Controlled experiments showed that increasing bandwidth from 100 to 200 Gbps alone raised throughput by ~19% and cut latency by ~22%, highlighting the network as a decisive bottleneck. ZCube flattens the pyramid topology, reducing hops and eliminating hotspot congestion from KV Cache transfers. For Chinese AI companies limited by chip supply, extracting performance gains from existing infrastructure via network innovation becomes a strategic card, enabling them to compete without reliance on specific GPU ecosystems.

Key Points

Zhipu AI deployed ZCube in GLM-5.1 clusters, boosting inference throughput by 15% and cutting P99 latency by 40.6%.
ZCube reconfigures physical topology to flatten pyramid-like Clos networks, reducing switch costs by roughly one-third.
OpenAI's MRC protocol (with Nvidia, AMD, Intel, Microsoft, Broadcom) also targets GPU-to-GPU data transfer inefficiencies.

Why It Matters

With chip supply constrained, network efficiency becomes the new competitive edge in AI infrastructure.

Read Original Article

OpenAI and Zhipu AI release new network architectures to boost AI inference efficiency

Why It Matters

Related Articles

🚀 Stay Ahead in AI