Introducing Disaggregated Inference on AWS powered by llm-d
New container separates prefill and decode phases across GPUs, boosting utilization for agentic AI.
AWS has partnered with the open-source llm-d team to launch a new container (ghcr.io/llm-d/llm-d-aws) that brings disaggregated inference capabilities to its cloud platform. This solution directly tackles the inefficiency of traditional LLM serving, where the compute-intensive 'prefill' phase and memory-intensive 'decode' phase compete for the same GPU resources, leading to poor utilization. By architecturally separating these phases across a distributed pool of GPUs connected via AWS's high-speed Elastic Fabric Adapter (EFA), the system can schedule workloads more intelligently. This is critical for the new era of agentic and reasoning AI, where workflows generate exponentially more tokens and create highly variable, unpredictable demands on inference infrastructure.
The launch is the result of months of collaboration, integrating llm-d's Kubernetes-native framework with AWS-specific libraries like EFA and libfabric. The framework, built on top of the popular vLLM engine, extends it with production-grade orchestration and advanced scheduling for multi-node serving. It introduces 'well-lit paths'—reference architectures that package optimization strategies for different performance goals. Available on Amazon SageMaker HyperPod and Amazon Elastic Kubernetes Service (EKS), this solution allows enterprises to deploy large-scale AI applications with significantly improved GPU utilization, lower latency, and reduced operational costs, moving beyond the prototyping phase into efficient production deployment.
- Architecturally separates LLM inference into distinct prefill (compute) and decode (memory) phases across GPUs.
- Uses AWS's Elastic Fabric Adapter (EFA) for high-speed interconnect in multi-node, Kubernetes-native deployments.
- Targets agentic AI workloads, which can generate 10x more tokens and create highly variable inference demands.
Why It Matters
Enables cost-effective, large-scale deployment of complex AI agents by maximizing expensive GPU utilization.