Research & Papers

EdgeServing: Deadline-Aware Multi-DNN Serving at the Edge

Time-division GPU sharing plus early-exit inference reduces P95 latency by 35%

Deep Dive

Edge computing increasingly runs multiple DNN models on a single GPU, but existing schedulers rely on local heuristics that ignore how each decision impacts the tail latency of all concurrent queues. GPU spatial-sharing approaches sacrifice latency predictability. To address this, researchers from multiple institutions have developed EdgeServing, a deadline-aware multi-DNN serving system. EdgeServing adopts time-division GPU sharing combined with early-exit inference, allowing the system to terminate inference early for less demanding inputs while maintaining high predictability. It introduces a stability score that quantifies the global effect of candidate scheduling decisions on future queue status, enabling the system to cohesively select the model, exit point, and batch size at runtime to minimize predicted system-wide SLO impact.

Experimental results across several hardware platforms show that EdgeServing consistently outperforms representative baselines in both SLO violation ratio and P95 latency. The early-exit mechanism expands the scheduling action space under tight latency constraints, giving the system more flexibility to meet deadlines. EdgeServing is particularly relevant for real-time applications like autonomous navigation, smart surveillance, and industrial control where multiple AI models must run concurrently on resource-constrained edge devices. The paper has been accepted by IEEE ICCCN 2026, marking a step forward in practical, predictable edge AI serving.

Key Points
  • EdgeServing uses time-division GPU sharing with early-exit inference for predictable latency, unlike spatial-sharing approaches.
  • A stability score quantifies the global impact of each scheduling decision on future queues, improving system-wide SLO adherence.
  • Achieves lower SLO violation ratio and P95 latency on multiple hardware platforms compared to representative baselines.

Why It Matters

Enables reliable, real-time multi-DNN inference on edge devices for latency-sensitive applications like autonomous driving and industrial IoT.