Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
Eliminating temporary buffers in dispatch/combine speeds up TTFT and TPOT on Ascend NPUs.
A team of researchers (Tianlun Hu et al.) from Huawei and collaborating institutions have published a paper on arXiv detailing a new communication design for Mixture-of-Experts (MoE) inference on Ascend hardware. MoE models require massive token exchange across devices during dispatch and combine phases, which traditionally rely on buffer-centric approaches—using explicit inter-process relay and reordering buffers around collective transfers. This overhead becomes a major bottleneck in both prefill and decode stages. The proposed design eliminates these intermediate buffers entirely by exploiting Ascend's globally pooled high-bandwidth memory (HBM) and symmetric-memory allocation. Instead of relay buffers, the system performs direct placement into destination expert windows and direct reading from remote expert windows, retaining only lightweight control state (counts, offsets, synchronization metadata).
The researchers instantiate the design as two schedules: a prefill schedule optimized for throughput with richer planning state, and a compact decode schedule for latency-sensitive execution. Experiments on Ascend-based MoE workloads demonstrate reduced dispatch and combine latency in both settings. At the serving level, the implementation improves time to first token (TTFT), preserves competitive time per output token (TPOT), and enlarges the feasible scheduling space under practical latency constraints. This work highlights that on platforms with globally addressable device memory, reducing intermediate buffering around expert execution is a highly effective direction for accelerating MoE inference—a key insight for scaling large language models on Ascend hardware.
- Eliminates relay and reordering buffers in MoE dispatch/combine using Ascend's pooled HBM and symmetric-memory allocation.
- Two schedules: prefill (throughput-oriented) and decode (latency-oriented) both reduce communication latency.
- Improves time-to-first-token (TTFT) while maintaining competitive time-per-output-token (TPOT) on real workloads.
Why It Matters
Enables faster, more efficient MoE inference on Ascend NPUs, critical for scaling large language models in production.