Research & Papers

A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

arXiv cs.DC February 26, 2026

⚡New methodology lets developers combine CUDA, SYCL, and Triton APIs in a single application without performance penalties.

Deep Dive

A team of researchers from the Barcelona Supercomputing Center has published a paper outlining a novel task-based data-flow methodology designed to tackle the growing complexity of programming modern heterogeneous computing systems. As high-performance computing (HPC) and AI infrastructures increasingly rely on nodes combining multi-core CPUs with diverse accelerators like GPUs and FPGAs, developers face the arduous task of orchestrating multiple low-level, incompatible accelerator APIs such as NVIDIA's CUDA, Intel's SYCL, and OpenAI's Triton. The proposed solution aims to eliminate the error-prone and labor-intensive process of manually combining these different programming models within a single application.

The core innovation involves expressing applications as a directed acyclic graph (DAG) of tasks managed by an OpenMP/OmpSs-2 runtime. The researchers introduce Task-Aware SYCL (TASYCL) and leverage existing Task-Aware CUDA (TACUDA) to wrap individual accelerator kernel calls, elevating them to first-class tasks within the system. A critical component is the nOS-V threading library, which unifies thread management to prevent performance-degrading oversubscription when multiple native runtimes (like CUDA and SYCL) coexist. The paper demonstrates that this combination of task-aware libraries and unified threading enables a single application to harness the 'best-in-class' kernels from multiple accelerator APIs transparently and efficiently, providing a scalable path forward for increasingly complex hardware stacks.

Key Points

Proposes Task-Aware APIs (TA-libs) like TASYCL to seamlessly integrate CUDA, SYCL, and Triton programming models.
Uses the nOS-V library to unify thread management, solving oversubscription and performance variability in multi-API environments.
Enables applications to be expressed as task DAGs managed by OpenMP/OmpSs-2, making the methodology immediately usable on current HPC/AI nodes.

Why It Matters

This reduces development complexity and unlocks performance by letting AI and HPC applications use the best kernel for each task across NVIDIA, Intel, and other hardware.

Read Original Article

A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs

Why It Matters

Stay Ahead in AI