OpenAI and Zhipu AI release new network architectures to boost AI inference efficiency
New ZCube architecture cuts latency by 40% and switch costs by a third in production clusters.
On May 5, OpenAI joined forces with Nvidia, AMD, Intel, Microsoft, and Broadcom to release the MRC network transmission protocol, targeting data transfer inefficiencies between GPUs in large-scale AI clusters. Just two weeks later, Chinese AI unicorn Zhipu AI, together with Tsinghua University and Yusur Network, announced the large-scale deployment of ZCube network architecture in its GLM-5.1 online production inference cluster. The ZCube architecture reconfigures physical topology, achieving a 15% boost in GPU inference throughput at thousand-GPU scale, a 40.6% reduction in P99 time-to-first-token latency, and roughly one-third cost savings on switches and optical transceivers. This marks a turning point where the China-U.S. AI infrastructure race moves from GPU-count competition to network efficiency, especially critical for China facing high-end chip restrictions.
Traditional Clos network architectures struggle with asymmetric traffic patterns in Prefill-Decode disaggregated inference, causing GPUs to wait for data. Controlled experiments showed that increasing bandwidth from 100 to 200 Gbps alone raised throughput by ~19% and cut latency by ~22%, highlighting the network as a decisive bottleneck. ZCube flattens the pyramid topology, reducing hops and eliminating hotspot congestion from KV Cache transfers. For Chinese AI companies limited by chip supply, extracting performance gains from existing infrastructure via network innovation becomes a strategic card, enabling them to compete without reliance on specific GPU ecosystems.
- Zhipu AI deployed ZCube in GLM-5.1 clusters, boosting inference throughput by 15% and cutting P99 latency by 40.6%.
- ZCube reconfigures physical topology to flatten pyramid-like Clos networks, reducing switch costs by roughly one-third.
- OpenAI's MRC protocol (with Nvidia, AMD, Intel, Microsoft, Broadcom) also targets GPU-to-GPU data transfer inefficiencies.
Why It Matters
With chip supply constrained, network efficiency becomes the new competitive edge in AI infrastructure.