Accelerate Generative AI Inference on Amazon SageMaker AI with G7e Instances
New SageMaker instances double GPU memory to 96GB, enabling single-node deployment of 300B parameter models.
AWS has launched its next-generation G7e GPU instances for Amazon SageMaker AI, powered by NVIDIA's new RTX PRO 6000 Blackwell Server Edition GPUs. Each GPU delivers 96GB of GDDR7 memory—double the capacity of previous G6e instances—with memory bandwidth reaching 1,597 GB/s per GPU. The instances scale from single-GPU configurations (G7e.2xlarge) up to eight-GPU nodes (G7e.48xlarge), with the largest offering 768GB of aggregate GPU memory and 1,600 Gbps of networking throughput via Elastic Fabric Adapter (EFA). This represents a 4x improvement in network bandwidth over G6e instances.
These specifications enable significant performance gains for generative AI inference, with AWS claiming up to 2.3x faster performance compared to G6e instances. The increased memory density allows developers to host larger models on fewer nodes: a single GPU can now handle 35B parameter models, while an 8-GPU node can accommodate 300B parameter models. This reduces the need for complex multi-node setups, lowering operational overhead and inter-node latency. The instances support FP4 precision using NVIDIA's fifth-generation Tensor Cores, making them suitable for deploying large language models, multimodal AI, and agentic workflows.
Key use cases include chatbots with improved response times, Retrieval Augmented Generation (RAG) pipelines benefiting from faster context injection, and long-context inference where large KV caches are essential. The doubled memory also resolves out-of-memory errors for larger vision models and extends capabilities to physical AI and scientific computing applications. By providing more cost-effective and powerful infrastructure, AWS aims to help organizations scale their generative AI deployments while maintaining high performance for demanding inference workloads.
- Each NVIDIA RTX PRO 6000 GPU provides 96GB GDDR7 memory, doubling previous generation capacity
- Delivers up to 2.3x faster inference performance compared to G6e instances with 1,600 Gbps network bandwidth
- Enables single-node deployment of 300B parameter models, reducing multi-node complexity and latency
Why It Matters
Lowers costs and simplifies deployment of large AI models, making advanced generative AI more accessible to enterprises.