Research & Papers

VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination

New compiler uses 'virtual tensors' to cut AI model memory use by up to 60%, accelerating inference.

Deep Dive

A research team including authors from institutions like the University of Illinois Urbana-Champaign and Microsoft Research has introduced VTC, a groundbreaking Deep Neural Network (DNN) compilation framework. Accepted to the prestigious OSDI '26 conference, VTC tackles the critical bottleneck of data movement between compute and memory in modern AI workloads. Its core innovation is the concept of 'virtual tensors,' which track data movement between operators using lightweight index mappings instead of expensive physical transfers to and from global memory. This allows VTC to target the full spectrum of data movement operators, a significant leap beyond current optimizations like layout transformations and operator fusion, which only handle a subset.

VTC's novel data movement elimination algorithm automatically identifies the most profitable strategy for creating these virtual tensors. Crucially, the framework is designed to seamlessly interoperate with existing, highly-optimized computation kernels and can handle arbitrary compositions of tensor operators. In evaluations across a variety of DNNs, VTC demonstrated substantial performance gains, outperforming existing ML compilers by up to 1.93x, with an average speedup of 1.28x on NVIDIA GPUs. It also achieved significant memory savings of up to 60% during inference, with an average reduction of 17.5%. This makes it particularly relevant for deploying large, memory-intensive models like contemporary LLMs more efficiently.

Key Points
  • Introduces 'virtual tensors' that use index mappings to eliminate physical data transfers between operators.
  • Outperforms existing ML compilers by up to 1.93x (1.28x avg) on NVIDIA GPU benchmarks.
  • Reduces inference memory usage by up to 60% (17.5% avg), crucial for large language models.

Why It Matters

Directly accelerates AI inference and reduces hardware costs, enabling more powerful models to run on existing infrastructure.