Deploy SageMaker AI inference endpoints with set GPU capacity using training plans
Amazon repurposes training plans to guarantee p-family GPU access for time-sensitive LLM deployments.
AWS has announced a significant expansion of its SageMaker AI training plans, repurposing the reservation system to now support inference endpoints. Originally designed to secure GPU capacity for model training workloads, the feature now allows data science and ML engineering teams to reserve specific instance types—such as p-family GPUs like the ml.p5.48xlarge—for a predetermined time window. This directly tackles a major pain point: deploying large language models (LLMs) for inference during critical periods like evaluation sprints, limited-duration testing, or handling burst workloads, where on-demand capacity in a given AWS Region can be unreliable.
The solution involves a four-phase workflow. First, users identify their capacity requirements (instance type, count, and duration). They then search for available training plan offerings using a dedicated API, specifying `target-resources` as "endpoint" to filter for inference-ready capacity. After selecting an offering and creating a reservation, which generates an Amazon Resource Name (ARN), they deploy their SageMaker inference endpoint configured to use that reserved ARN. This guarantees the required GPU resources are available for the entire reservation period, providing cost control and predictable availability for time-bound inference tasks.
- Training plans, previously for model training, now support inference endpoints for reserved GPU capacity.
- Users can search for and reserve specific p-family instances (e.g., ml.p5.48xlarge) for set durations using a dedicated API.
- The reserved capacity is referenced via an ARN in the endpoint configuration, ensuring deployment on guaranteed instances.
Why It Matters
Eliminates deployment delays for critical AI evaluation and production workloads by guaranteeing GPU access, improving project timelines and cost predictability.