Research & Papers

VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU

New model abstracts async hardware as virtual cores, slashing programming effort 90%.

Deep Dive

Modern GPUs pack increasingly specialized asynchronous hardware units—tensor cores, direct memory access engines, and more—to deliver high performance. Yet their potential remains underutilized because GPU software stacks still cling to a monolithic kernel model that treats each execution as a single, synchronous block. This mismatch forces programmers to manually orchestrate which unit does what, wasting effort and leaving throughput on the table.

VDCores, created by researchers from UC San Diego and Cornell, solves this with a new abstraction: Virtual Decoupled Engines. It represents each asynchronous hardware unit as a resource‑isolated virtual core and decomposes workloads into tiny, dependency‑connected micro‑operations (micro‑ops). The runtime automatically schedules these micro‑ops based on data dependencies and resource availability, achieving seamless overlap of memory and compute without programmer intervention. Tested on LLM inference (including decoding) across NVIDIA GH200, H100, and RTX 6000 Pro GPUs, VDCores delivered a 24% average throughput boost, spiking to 77% under dynamic inputs. Even better, it slashed kernel programming and specialization effort by 90%. The team has open‑sourced VDCores, making this decoupled paradigm immediately available for researchers and engineers looking to squeeze more performance from modern GPU architectures.

Key Points
  • VDCores abstracts asynchronous GPU hardware as virtual cores with dependency-connected micro-ops, automating memory-compute overlap.
  • LLM decoding throughput improved by 24% on average (up to 77%) across GH200, H100, and RTX 6000 Pro under dynamic inputs.
  • Kernel programming and specialization effort reduced by 90% due to automatic orchestration.

Why It Matters

VDCores unlocks latent GPU async compute power, significantly accelerating LLM inference with minimal developer effort.