Huawei's TurboGR slashes AI recommendation training time by 97%
New system accelerates generative recommendation training to 54.71% MFU on Ascend NPUs...
Researchers from Huawei (led by Huichao Chai) unveiled TurboGR at arXiv, a purpose-built training system for generative recommendation (GR) models that replaces fragmented architectures with unified Transformer-based systems. The team identified core bottlenecks when deploying GR on Ascend NPUs—particularly the lack of jagged operator support and architectural mismatch between sparse primitives and NPUs' dense-computation design.
TurboGR addresses these challenges through three innovations: (1) Ascend-optimized jagged acceleration with fusion operators that eliminate padding redundancy and dynamic load balancing that reduces inter-device imbalance from 47% to 2.4%, (2) distributed communication optimizations including hierarchical sparse parallelism and semi-asynchronous training with 94% NPU utilization, and (3) negative sampling optimizations via FP16 quantization and logit sharing. On the KuaiRand-27K dataset, TurboGR supports training models up to 0.2B parameters while achieving 54.71% MFU with near-linear scalability (0.97).
- TurboGR is an Ascend NPU-optimized training system for generative recommendation models up to 0.2B parameters
- Achieves 54.71% MFU with near-linear scalability (0.97) and cuts inter-device imbalance from 47% to 2.4%
- Introduces jagged operator fusion, semi-asynchronous training, and FP16 quantization for efficient large-scale training
Why It Matters
Enables cost-effective, high-performance training of generative recommendation systems for production-scale AI applications.