Lifting to tensors when compiling scientific computing workloads for AI Engines
A novel compilation pipeline lets existing OpenMP codes run on AMD's NPU with no manual porting.
A new paper accepted at CCGrid 2026 presents a compilation pipeline that automatically maps general-purpose scientific computing loops to AMD's AI Engines (AIEs), specialized NPUs integrated into AMD CPUs. The key innovation is 'lifting' loop semantics into a tensor representation. This abstraction captures the intent of loops annotated with OpenMP directives, allowing the compiler to efficiently schedule computation onto the AIE's dataflow architecture without requiring developers to manually rewrite or port their code. The approach significantly reduces the expertise and time needed to target this heterogeneous compute resource, enabling existing scientific codes to benefit from the NPU's energy and performance advantages.
In benchmark tests across six kernels spanning AI and scientific computing, the NPU achieved performance comparable to a multicore CPU for float32 workloads while consuming less energy. More notably, for two specific scientific computing kernels, the team demonstrated that executing the workload across both the CPU and NPU simultaneously (heterogeneous execution) delivered up to a 40% improvement in performance and a 15% reduction in energy usage compared to relying solely on the CPU. This highlights the potential for hybrid CPU+NPU execution to accelerate scientific workloads in edge and resource-constrained environments, where maximizing performance per watt is critical.
- Compilation pipeline lifts OpenMP-annotated loops to tensor representation for AMD AI Engines.
- NPU matches CPU float32 performance across six benchmarks with lower energy consumption.
- Heterogeneous CPU+NPU execution delivers up to 40% speedup and 15% energy savings over CPU alone.
Why It Matters
Enables existing scientific codes to leverage AMD's NPU without manual rewriting, boosting performance and energy efficiency.