Research & Papers

Scepsy: Serving Agentic Workflows Using Aggregate LLM Pipelines

New system from Imperial College researchers cuts agentic workflow latency by 27x using aggregate LLM pipelines.

Deep Dive

A research team from Imperial College London and Huawei has unveiled Scepsy, a novel system designed to efficiently serve complex AI agentic workflows on GPU clusters. These workflows, which chain together multiple large language models (LLMs) and tools to complete tasks, present a major serving challenge. Their execution is unpredictable, often branching or fanning out, and they typically require more LLM instances than there are available GPUs, leading to oversubscription and poor performance.

Scepsy tackles this by exploiting a key insight: while an agentic workflow's total execution time is variable, the relative share of time each constituent LLM consumes remains stable. The system first profiles LLMs under different parallelism configurations. It then uses this data to construct an 'Aggregate LLM Pipeline'—a lightweight model that predicts the latency and throughput of any potential GPU allocation strategy.

Using this predictor, Scepsy performs a heuristic search over a vast space of possible allocations, including fractional GPU shares, tensor parallelism degrees, and replica counts. Its goal is to find a configuration that minimizes end-to-end latency while hitting a target throughput. Finally, it places this optimal allocation onto the physical GPU cluster, minimizing memory fragmentation and respecting network topology constraints.

In evaluations on realistic agentic workflows, Scepsy dramatically outperformed existing approaches. It achieved up to 2.4 times higher throughput and slashed latency by a factor of 27 compared to systems that optimize LLMs independently or rely on manual, user-specified allocations. This represents a significant leap in making complex, multi-step AI agents practical for real-time, production-scale applications.

Key Points
  • Uses Aggregate LLM Pipelines to predict performance, enabling optimal fractional GPU, tensor parallelism, and replica allocation.
  • Achieves up to 2.4x higher throughput and 27x lower latency vs. standard LLM-serving methods.
  • Solves GPU oversubscription for agentic workflows, where LLMs often outnumber available hardware.

Why It Matters

Enables scalable, low-latency deployment of complex AI agents, moving them from research prototypes to reliable production services.