A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs
New methodology lets developers combine CUDA, SYCL, and Triton APIs in a single application without performance penalties.
A team of researchers from the Barcelona Supercomputing Center has published a paper outlining a novel task-based data-flow methodology designed to tackle the growing complexity of programming modern heterogeneous computing systems. As high-performance computing (HPC) and AI infrastructures increasingly rely on nodes combining multi-core CPUs with diverse accelerators like GPUs and FPGAs, developers face the arduous task of orchestrating multiple low-level, incompatible accelerator APIs such as NVIDIA's CUDA, Intel's SYCL, and OpenAI's Triton. The proposed solution aims to eliminate the error-prone and labor-intensive process of manually combining these different programming models within a single application.
The core innovation involves expressing applications as a directed acyclic graph (DAG) of tasks managed by an OpenMP/OmpSs-2 runtime. The researchers introduce Task-Aware SYCL (TASYCL) and leverage existing Task-Aware CUDA (TACUDA) to wrap individual accelerator kernel calls, elevating them to first-class tasks within the system. A critical component is the nOS-V threading library, which unifies thread management to prevent performance-degrading oversubscription when multiple native runtimes (like CUDA and SYCL) coexist. The paper demonstrates that this combination of task-aware libraries and unified threading enables a single application to harness the 'best-in-class' kernels from multiple accelerator APIs transparently and efficiently, providing a scalable path forward for increasingly complex hardware stacks.
- Proposes Task-Aware APIs (TA-libs) like TASYCL to seamlessly integrate CUDA, SYCL, and Triton programming models.
- Uses the nOS-V library to unify thread management, solving oversubscription and performance variability in multi-API environments.
- Enables applications to be expressed as task DAGs managed by OpenMP/OmpSs-2, making the methodology immediately usable on current HPC/AI nodes.
Why It Matters
This reduces development complexity and unlocks performance by letting AI and HPC applications use the best kernel for each task across NVIDIA, Intel, and other hardware.