Developer Tools

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Share one GPU across dozens of custom AI models with 19% faster token generation and 8% lower latency.

Deep Dive

AWS has announced a significant optimization for hosting multiple fine-tuned AI models, developed in collaboration with the open-source vLLM community. The new Multi-LoRA (Low-Rank Adaptation) serving capability specifically targets the challenge of costly, underutilized GPU capacity when organizations deploy numerous custom models, particularly from the emerging Mixture of Experts (MoE) families like GPT-OSS and Qwen. Instead of dedicating a GPU to a single, sporadically-used model, this solution allows dozens of models to share the same hardware by keeping the base model weights frozen and dynamically loading only the small, trained LoRA adapters for each inference request. This turns what would be five underutilized GPUs into one efficiently shared resource, dramatically improving cost efficiency for custom AI deployments on Amazon SageMaker AI and Amazon Bedrock.

The technical breakthrough involves kernel-level optimizations within vLLM to efficiently handle the fused operations required for MoE architectures. For a model like GPT-OSS 20B, AWS-specific enhancements deliver a 19% increase in Output Tokens Per Second (OTPS) and an 8% reduction in Time To First Token (TTFT) compared to the standard vLLM 0.15.0 release. While built for sparse MoE models, the improvements also benefit dense models like Llama3.3 70B. This move directly addresses a major pain point in enterprise AI: the prohibitive cost of serving many specialized, low-traffic models. It empowers companies to deploy a fleet of fine-tuned models for different tasks or departments without multiplying their infrastructure spend, making customized AI more accessible and scalable.

Key Points
  • Enables multiple fine-tuned models (e.g., GPT-OSS 20B, Qwen3-MoE) to share a single GPU via dynamic LoRA adapter swapping, eliminating idle capacity.
  • Delivers 19% higher Output Tokens Per Second and 8% lower Time To First Token for GPT-OSS 20B with AWS-specific optimizations in vLLM.
  • Available now in vLLM 0.15.0+ for local deployments and optimized for hosting on Amazon SageMaker AI and Amazon Bedrock.

Why It Matters

Drastically reduces the cost and complexity of serving dozens of specialized AI models, making custom enterprise AI deployments financially viable.