Research & Papers

AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

An AI agent that learns from hardware feedback to optimize Huawei's Ascend NPUs, achieving a 1.19x speedup.

Deep Dive

A team from Huawei and Shanghai Jiao Tong University has introduced AscendOptimizer, an AI agent designed to tackle the significant challenge of optimizing software 'operators' for Huawei's Ascend Neural Processing Units (NPUs). Unlike the mature CUDA ecosystem for Nvidia GPUs, the Ascend platform lacks extensive public optimization knowledge, creating a major bottleneck for developers. AscendOptimizer addresses this by turning the optimization process itself into a learning experience for an AI agent, which operates in a closed loop between two critical components.

On the 'host' side, which manages data movement, the agent performs a profiling-in-the-loop evolutionary search. It directly tests different tiling and data-movement configurations on the hardware, using the performance feedback to discover valid and high-performing setups. On the more complex 'kernel' side, where computation is scheduled, AscendOptimizer employs a clever 'rewinding' technique. It takes already-optimized kernels and systematically de-optimizes them to create instructive 'bad-to-good' trajectories. These trajectories are mined for transferable optimization patterns and stored in a retrievable experience bank, which then guides the AI in rewriting other kernels.

The agent alternates between tuning the host program and rewriting the kernel, steadily pushing performance higher. In tests on a benchmark of 127 real-world AscendC operators, this method proved highly effective. AscendOptimizer achieved a 1.19x geometric-mean speedup over open-source baselines, with 49.61% of operators outperforming their manually optimized references. The system also outperformed other strong AI agent and automated search baselines, demonstrating its novel approach's superiority in this specialized domain.

Key Points
  • Uses a two-part AI agent to optimize both host-side tiling/data-movement and kernel-side instruction scheduling for Huawei Ascend NPUs.
  • Employs a novel 'rewinding' technique to de-optimize kernels and mine transferable optimization patterns, building a retrievable experience bank.
  • Achieved a 1.19x geometric-mean speedup on 127 operators, with nearly 50% beating manual references, automating a scarce expertise bottleneck.

Why It Matters

Automates performance optimization for a critical AI hardware platform (Huawei Ascend), reducing reliance on scarce expert knowledge and accelerating development.