Research & Papers

GTaP: A GPU-Resident Fork-Join Task-Parallel Runtime with a Pragma-Based Interface

New pragma-based system lets GPUs handle irregular workloads 1.8x faster, beating 72-core CPUs.

Deep Dive

Researchers Yuki Maeda and Kenjiro Taura have introduced GTaP, a novel runtime system designed to bring efficient fork-join task parallelism—a staple of CPU programming—to GPU architectures. GPUs traditionally excel at regular, data-parallel workloads but struggle with irregular applications like graph algorithms or recursive computations. GTaP addresses this by implementing a persistent kernel model where tasks are executed as state machines, with joins represented as continuations. This allows the GPU to manage complex, nested task graphs natively. The system supports two worker granularities: entire thread blocks or individual threads, and employs work-stealing for dynamic load balancing, providing better scalability than simple global-queue approaches.

A key innovation is the Execution-Path-Aware Queueing (EPAQ) mechanism for thread-level workers. EPAQ lets programmers partition task queues based on user-defined criteria, which helps group tasks with similar control flows together. This reduces performance-sapping 'warp divergence,' a common GPU inefficiency where threads in a warp (a group of 32 threads) execute different instructions. The performance impact is significant: in benchmarks, GTaP outperformed OpenMP task-parallel execution on a high-end 72-core CPU, especially for large problem sizes with compute-heavy tasks. On the classic Fibonacci benchmark, the EPAQ optimization delivered up to a 1.8x speedup. The researchers also provide a pragma-based interface via an extended Clang compiler frontend, allowing developers to express fork-join parallelism without grappling with low-level GPU mechanics.

Key Points
  • GTaP implements a persistent kernel model, executing tasks as state machines with joins as continuations to enable native GPU fork-join.
  • Its Execution-Path-Aware Queueing (EPAQ) reduces warp divergence, achieving up to a 1.8x speedup on the Fibonacci benchmark.
  • The system outperforms OpenMP on a 72-core CPU for large, compute-intensive irregular workloads and includes a pragma-based Clang interface for easier programming.

Why It Matters

Unlocks GPU acceleration for complex, irregular algorithms in fields like graph analytics, AI agent workflows, and scientific simulations, moving beyond simple data-parallel tasks.