Research & Papers

Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

Uses just 16 fixed cores to outperform spatial designs by 211x on AMD Versal chips

Deep Dive

Tempus, a new framework from researchers M. Grailoo and J. Núñez-Yáñez, tackles the fundamental challenge of running large language models (LLMs) on edge devices—where compute, memory, and power are severely constrained. Since matrix multiplication (GEMM) accounts for up to 90% of LLM inference time, efficient GEMM acceleration is critical. Existing state-of-the-art frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores. However, this approach fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption.

Tempus flips the paradigm: instead of adding more hardware as matrix size grows, it uses a fixed compute block of just 16 AIE-ML cores on the AMD Versal Adaptive SoC. Scalability is achieved through temporal scaling—iterative graph execution combined with algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at an Initiation Interval of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse.

On benchmark GEMM workloads, Tempus delivers 607 GOPS at 10.677W total on-chip power. The framework's Platform-Aware Utility (PAU) metric shows a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). More impressively, it uses 0.00% of URAM/DSP resources, achieving 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand. These results establish Tempus as a sustainable, scalable foundation for edge LLM inference on AMD's adaptive SoCs.

Key Points
  • Tempus uses only 16 fixed AIE-ML cores, achieving scalability through temporal (time-based) streaming rather than adding more hardware.
  • Delivers 607 GOPS at 10.677W total power, with a 211.2x higher prominence factor than the leading spatial framework (ARIES).
  • Zero URAM/DSP utilization, 22x core frugality, 7.1x power frugality, and 6.3x I/O reduction compared to spatial scaling approaches.

Why It Matters

Enables practical LLM inference on power-constrained edge devices without sacrificing performance or efficiency.