Image & Video

DeDelayed: Deleting Remote Inference Delay via On-Device Correction

New method improves real-time video segmentation by 9.8 mIoU despite 100ms cloud delay.

Deep Dive

A research team led by Dan Jacobellis has introduced DeDelayed, a novel system designed to eliminate the latency bottleneck in real-time video AI applications. The core challenge is that powerful video understanding models require cloud GPUs, but the round-trip delay for sending frames, processing them, and receiving results is too slow for critical tasks like autonomous driving or robotic control. DeDelayed's solution is a hybrid architecture: a remote model in the cloud processes slightly delayed video frames and is specifically trained to make predictions about anticipated *future* frames. These predictions are then sent back to the device.

Simultaneously, a lightweight local model on the device has access to the *current* live video feed. This local model's job is to incorporate the cloud's 'future' predictions to correct its own analysis of the present moment. The two models are jointly optimized with a compression autoencoder to minimize the data that needs to be transmitted. In tests on the BDD100k driving dataset for real-time segmentation—a key task for understanding a vehicle's surroundings—DeDelayed delivered dramatic results. With a simulated 100ms network delay, it outperformed a purely local inference system by 6.4 mIoU (mean Intersection over Union, a segmentation accuracy metric) and beat a standard remote inference setup by 9.8 mIoU.

This performance gain is equivalent to what would be achieved by using a model ten times larger, but without the associated computational or power costs. The system effectively makes the cloud's latency invisible to the end application, enabling high-accuracy, real-time perception on resource-constrained platforms like drones, AR glasses, and mobile robots. The team has released the training code, pretrained models, and a Python library, making this research immediately accessible for further development and integration.

Key Points
  • Hybrid cloud/device architecture improves video segmentation by 9.8 mIoU despite 100ms delay, matching a 10x larger model's gain.
  • Remote model predicts future frames; local on-device model corrects predictions for the current live frame.
  • Joint optimization with a compression autoencoder minimizes required downlink bandwidth for real-world use.

Why It Matters

Enables high-accuracy, real-time AI vision for drones, wearables, and robots without requiring massive local computing power.