AcOrch uses a two-level pipelined execution to overlap sampling, gathering, and training across CPUs and NPUs?

AcOrch uses a two-level pipelined execution to overlap sampling, gathering, and training across CPUs and NPUs.

On Ascend 910B NPU, it achieves a 2.31x average speedup over MindSporeGL, the current NPU-native GNN system?

On Ascend 910B NPU, it achieves a 2.31x average speedup over MindSporeGL, the current NPU-native GNN system.

Tasks are dynamically mapped to AI Cube (AIC) and AI Vector (AIV) units within the NPU, maximizing resource utilization?

Tasks are dynamically mapped to AI Cube (AIC) and AI Vector (AIV) units within the NPU, maximizing resource utilization.

Research & Papers

AcOrch system accelerates GNN training 2.31x on Ascend NPU

arXiv cs.DC June 02, 2026

⚡New pipeline orchestrates CPU and NPU compute units for massive speed gains.

Deep Dive

Graph Neural Networks (GNNs) are critical for applications like social networks and drug discovery, but training on large graphs is resource-intensive. Sampling-based mini-batch training reduces costs but still requires careful coordination of subgraph sampling, feature gathering, and model training—each with distinct compute demands. To address this, researchers have developed AcOrch, a training system designed specifically for CPU-NPU heterogeneous platforms like Huawei's Ascend AI processors. AcOrch introduces fine-grained task orchestration with a two-level pipelined execution model: the first level overlaps CPU-based sampling with NPU-based gathering and training, while the second level further parallelizes operations across different NPU compute units—AI Cube (AIC) for matrix multiplication and AI Vector (AIV) for vector processing. This dual-pipeline approach ensures CPUs and NPUs stay busy simultaneously, dramatically reducing idle time.

In benchmarks on an Ascend 910B NPU, AcOrch delivered an average 2.31x speedup over MindSporeGL, the current state-of-the-art NPU-native graph learning system. The paper, set to appear in Frontiers of Computer Science, details how AcOrch analyzes the heterogeneous compute features of NPUs to intelligently map tasks to the most appropriate unit. The result is a system that not only speeds up training but also improves energy efficiency by fully utilizing available hardware. For organizations deploying GNNs at scale, AcOrch offers a practical path to faster iteration without expensive upgrades—simply by better orchestrating existing CPU and NPU resources. The source code and additional details are available on arXiv.

Key Points

AcOrch uses a two-level pipelined execution to overlap sampling, gathering, and training across CPUs and NPUs.
On Ascend 910B NPU, it achieves a 2.31x average speedup over MindSporeGL, the current NPU-native GNN system.
Tasks are dynamically mapped to AI Cube (AIC) and AI Vector (AIV) units within the NPU, maximizing resource utilization.

Why It Matters

2.31x faster GNN training on existing NPUs means significant cost and time savings for large-scale graph AI workloads.

Read Original Article

AcOrch system accelerates GNN training 2.31x on Ascend NPU

Why It Matters

Related Articles

🚀 Stay Ahead in AI