Tenstorrent Wormhole stencil compute matches CPU, uses less energy
New research shows AI accelerator competitive for traditional HPC kernels.
A new paper from Lorenzo Piarulli and Daniele De Sensi explores whether AI-focused accelerators can efficiently handle traditional scientific workloads. They mapped 2D 5-point stencil computations—a core HPC kernel—onto Tenstorrent's Wormhole, a RISC-V dataflow accelerator designed for AI. Two heterogeneous implementations were developed: Axpy, which breaks the stencil into element-wise submatrix operations, and MatMul, which reformulates it as a matrix multiplication. The CPU baseline was 3x faster end-to-end, but detailed profiling revealed the isolated Wormhole kernel was competitive with CPU execution, with the gap driven by PCIe transfers, device initialization, and host-side preprocessing.
Despite the slower runtime, the Axpy implementation achieved lower energy consumption than the CPU baseline for large inputs. The study identifies key architectural and software limitations, including memory bandwidth and host-device communication, and outlines concrete hardware and software improvements that could make AI accelerators like Wormhole competitive for HPC workloads. This work suggests that with better integration, AI accelerators could complement CPUs for energy-efficient scientific computing, not just AI training and inference.
- CPU baseline was 3x faster end-to-end, but isolated Wormhole kernel matched CPU performance
- Axpy implementation consumed less energy than CPU for large input sizes
- Main bottlenecks identified: PCIe transfers, device initialization, and host-side preprocessing
Why It Matters
Suggests AI accelerators could supplement CPUs for energy-efficient scientific computing, expanding their role beyond AI.