Research & Papers

Tempus framework runs LLMs on edge SoCs with 22x more efficiency using temporal scaling

Uses just 16 fixed cores to outperform spatial designs by 211x on AMD Versal chips

Deep Dive

Tempus, a new framework from researchers M. Grailoo and J. Núñez-Yáñez, tackles the fundamental challenge of running large language models (LLMs) on edge devices—where compute, memory, and power are severely constrained. Since matrix multiplication (GEMM) accounts for up to 90% of LLM inference time, efficient GEMM acceleration is critical. Existing state-of-the-art frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores. However, this approach fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption.

Tempus flips the paradigm: instead of adding more hardware as matrix size grows, it uses a fixed compute block of just 16 AIE-ML cores on the AMD Versal Adaptive SoC. Scalability is achieved through temporal scaling—iterative graph execution combined with algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at an Initiation Interval of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse.

On benchmark GEMM workloads, Tempus delivers 607 GOPS at 10.677W total on-chip power. The framework's Platform-Aware Utility (PAU) metric shows a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). More impressively, it uses 0.00% of URAM/DSP resources, achieving 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand. These results establish Tempus as a sustainable, scalable foundation for edge LLM inference on AMD's adaptive SoCs.

Key Points
  • Tempus uses only 16 fixed AIE-ML cores, achieving scalability through temporal (time-based) streaming rather than adding more hardware.
  • Delivers 607 GOPS at 10.677W total power, with a 211.2x higher prominence factor than the leading spatial framework (ARIES).
  • Zero URAM/DSP utilization, 22x core frugality, 7.1x power frugality, and 6.3x I/O reduction compared to spatial scaling approaches.

Why It Matters

Enables practical LLM inference on power-constrained edge devices without sacrificing performance or efficiency.

📬 Get the top 10 AI stories daily