Eliminating Hidden Serialization in Multi-Node Megakernel Communication
Megakernels hit a 10x regression on 8 nodes—Perseus fixes the hidden serialization.
Modern Mixture-of-Experts (MoE) models scale by routing tokens to different experts across GPUs. Recent megakernel designs fuse expert computation with GPU-initiated communication into a single persistent kernel, overlapping data transfer with compute at tile granularity. This works well on a single node, but when experts span multiple nodes connected by RDMA, performance regresses by up to 10x on 8 nodes—and the regression worsens with node count.
The root cause is hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline. As the number of concurrent transfers grows, the cost of these fences inflates network latency, exposing communication on the critical path when per-expert compute is too small to mask it. The researchers propose Perseus, which applies two techniques: decoupled signaling batches fences at per-destination granularity, reducing fence counts by 8x; and NIC-side ordering replaces proxy stalls with hardware fence flags so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3x end-to-end speedup. It also matches or exceeds GPU-direct transport (IBGDA) by up to 1.2x, proving that serialization—not transport choice—limits multi-node megakernel performance.
- MoE megakernels regress up to 10x on 8-node clusters due to proxy-based RDMA fence overhead.
- Perseus reduces fence count by 8x via decoupled signaling and eliminates proxy stalls with NIC-side ordering.
- Achieves up to 10.3x end-to-end speedup on proxy-based transports and matches GPU-direct performance.
Why It Matters
Perseus unlocks efficient multi-node MoE inference, critical for scaling large language models across clusters.