Agent Frameworks

MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems

A new co-scheduler for LLM agents that slashes end-to-end latency by nearly 6x.

Deep Dive

A team of researchers has published a paper on arXiv introducing MARS, an efficient, adaptive co-scheduling system designed for heterogeneous agentic workloads. Unlike traditional LLM inference that handles single-turn text generation, agentic systems involve multi-turn LLM-to-tool loops, shifting resource demands from pure GPU execution to a mix of GPU and CPU tasks. MARS addresses this by establishing a unified information stream that provides holistic visibility across both GPU inference and CPU tool execution. An external control plane decouples admission from execution to prevent resource oversubscription, while an internal agent-centric scheduler minimizes the end-to-end critical path by prioritizing latency-sensitive continuations and adaptively retaining KV cache only when warm resumption offers a latency benefit.

In evaluations, MARS reduces end-to-end latency by up to 5.94x while maintaining nearly maximal system throughput. The researchers also integrated MARS as the serving backend for the OpenHands coding agent framework, where it accelerated end-to-end task completion time by up to 1.87x, demonstrating real-world effectiveness. The paper includes 14 pages and 13 figures, covering operating systems, distributed computing, and multiagent systems. The source code will be made publicly available. This work highlights a critical system-level challenge as LLM-based agents become more prevalent, offering a practical solution for optimizing resource coordination in production environments.

Key Points
  • MARS reduces end-to-end latency for agentic AI workloads by up to 5.94x while maintaining near-maximum throughput.
  • It uses a unified information stream across GPU and CPU, plus an agent-centric scheduler that prioritizes latency-sensitive continuations and adaptive KV cache retention.
  • Real-world testing with the OpenHands coding agent showed a 1.87x speedup in task completion time.

Why It Matters

As LLM agents move from single-turn inference to multi-tool loops, MARS provides a system-level solution to prevent GPU-CPU bottlenecks and latency spikes.