Developer Tools

Capacity-aware inference: Automatic instance fallback for SageMaker AI endpoints

No more manual retries: prioritized instance pools keep endpoints running on available hardware.

Deep Dive

Amazon SageMaker AI today announced capacity-aware instance pools, a new feature that automatically falls back to alternative GPU instance types when the preferred hardware is unavailable. Previously, deploying a real-time inference endpoint required committing to a single instance type at creation time. If that type lacked capacity, the endpoint would fail with an InsufficientCapacity error, forcing users to manually iterate through alternatives—each attempt consuming significant time. The same problem occurred during auto scaling scale-out events, where the autoscaler would retry the same constrained type indefinitely, leaving traffic unserved. The new feature solves this by letting users define a ranked list of instance types in the endpoint configuration. SageMaker AI probes the list automatically: it tries the first choice, then the second, third, and so on until it finds available capacity. Endpoints reach InService in minutes without human intervention.

During scale-out, SageMaker AI scales on the next available type in the priority list, ensuring traffic continues flowing even when the preferred hardware is constrained. Scale-in is similarly intelligent—instances from the lowest-priority (fallback) types are removed first, so the fleet naturally trends back toward preferred hardware over time as capacity frees up. Observability also improves: every existing CloudWatch metric now includes an InstanceType dimension, enabling users to track latency, throughput, GPU utilization, and instance count per type within a single endpoint. The feature works for Single Model, Inference Component-based, and Asynchronous endpoints. For models that require specific GPU memory or compute, users can either bring their own optimized models for each fallback type or rely on SageMaker’s built-in optimizations. The update significantly reduces operational overhead for organizations running large-scale generative AI workloads.

Key Points
  • Define a prioritized list of instance types; SageMaker automatically tries each during creation, scale-out, and scale-in until capacity is found.
  • During scale-in, lowest-priority instances are removed first, gradually shifting the fleet back to preferred hardware without manual action.
  • CloudWatch metrics now include an InstanceType dimension, enabling per-type analysis of latency, throughput, and GPU utilization.

Why It Matters

Eliminates manual GPU capacity management for LLM endpoints, reducing downtime and operational overhead at scale.