Research & Papers

Collaborative Processing for Multi-Tenant Inference on Memory-Constrained Edge TPUs

New research shows 77.4% faster AI inference on memory-constrained Edge TPUs using dynamic workload partitioning.

Deep Dive

A research team from UMass Amherst and the Indian Institute of Science has developed SwapLess, a breakthrough system that dramatically improves AI inference performance on memory-constrained Edge TPUs. Published on arXiv, the system addresses a critical bottleneck in edge computing: the limited on-chip memory of AI accelerators that forces constant swapping of model segments between host and accelerator memory, significantly inflating latency.

The core innovation is SwapLess's adaptive approach to collaborative processing. Unlike static partitioning methods, SwapLess continuously adjusts both the partition point (where model processing splits between CPU and TPU) and CPU core allocation using an analytic queueing model. This model captures partition-dependent service times and swapping overheads across different workload mixes and request rates. The system makes these adjustments online with minimal decision overhead, optimizing for end-to-end response time rather than just memory usage.

Implementation results are striking: SwapLess reduces mean latency by up to 63.8% for single-tenant workloads and 77.4% for multi-tenant workloads compared to Google's default Edge TPU compiler. This performance gain comes from intelligently balancing computation between CPU and TPU resources to minimize both memory pressure and excessive CPU computation. The system is particularly effective in dynamic, multi-tenant environments where multiple AI models compete for limited accelerator resources.

This research matters because Edge TPUs are increasingly deployed in IoT devices for applications requiring low-latency AI inference, from autonomous vehicles to industrial monitoring. SwapLess enables these devices to run more complex models or serve more users without hardware upgrades, potentially accelerating the deployment of AI at the edge while maintaining performance guarantees.

Key Points
  • Reduces latency by up to 77.4% for multi-tenant Edge TPU workloads compared to default compiler
  • Uses adaptive partitioning that continuously adjusts CPU-TPU split and core allocation based on queueing model
  • Addresses memory swapping bottleneck that plagues edge AI accelerators with limited on-chip memory

Why It Matters

Enables faster, more efficient on-device AI for IoT applications in connectivity-limited or safety-critical scenarios without hardware upgrades.