Research & Papers

MultiWrite: New multicast method cuts AI collective comm latency by 33%

Redundant data copies eliminated, achieving up to 33% latency reduction on Ascend NPUs.

Deep Dive

Collective communication operations such as AllGather and AlltoAll are critical bottlenecks in large-scale AI training and inference. Traditional unicast-based implementations send duplicate copies of the same data across physical links for multiple receivers, causing network congestion and increased latency. A new paper from researchers including Chao Xu introduces MultiWrite, a transmission semantic that borrows multicast principles while overcoming traditional multicast's heavy management overhead and ecosystem compatibility issues.

MultiWrite is implemented on Ascend NPUs and tested under long-term stress conditions. Results show up to 33% latency reduction compared to unicast-based operators, directly accelerating many-to-many communication patterns essential for distributed parallelization. By eliminating redundant packet transmissions, MultiWrite improves network utilization and end-to-end performance, offering a practical path to faster large model training and inference without requiring major hardware changes.

Key Points
  • MultiWrite eliminates redundant packet copies in collective communication like AllGather and AlltoAll.
  • Achieves up to 33% latency reduction on commercially deployed Ascend NPUs under long-term stress tests.
  • Overcomes traditional multicast limitations: heavy management plane overhead and ecosystem compatibility issues.

Why It Matters

Faster collective communication directly accelerates large model training and inference, reducing costs and time-to-deployment.