Research & Papers

PipeSD framework speeds up cloud-edge LLM inference by up to 2.16x

A new pipeline scheduling method cuts energy use 25% while boosting token throughput.

Deep Dive

PipeSD addresses key bottlenecks in cloud-edge LLM inference with speculative decoding. Existing methods suffer from sequential generation and communication, low resource utilization, and inflexible verification triggering that causes premature verification or costly rollbacks. PipeSD introduces a token-batch pipeline scheduling mechanism optimized via dynamic programming to overlap token generation and communication. It also features a dual-threshold non-autoregressive verification (NAV) triggering mechanism with a lightweight Bayesian optimization autotuner, enabling flexible verification decisions.

Implemented using llama-cpp-python, PyTorch, and FastAPI, PipeSD was evaluated on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios. Results show consistent outperformance over state-of-the-art baselines: 1.16x–2.16x speedup and 14.3%–25.3% energy reduction. This work, accepted at ICML 2026, demonstrates practical gains for deploying large language models across cloud and edge devices.

Key Points
  • 1.16x–2.16x speedup over state-of-the-art baselines in cloud-edge LLM inference
  • 14.3%–25.3% reduction in energy consumption across four test scenarios
  • Novel token-batch pipeline scheduling (dynamic programming) and dual-threshold NAV triggering (Bayesian autotuner)

Why It Matters

Makes cloud-edge LLM deployment faster and greener—key for real-time apps and edge AI scaling.