Llama 3.1 405B (800GB checkpoint) loads in seconds instead of 10–20 minutes using sharded GDS on FSx for Lustre?

Llama 3.1 405B (800GB checkpoint) loads in seconds instead of 10–20 minutes using sharded GDS on FSx for Lustre.

GPUDirect Storage bypasses CPU and PCIe, letting all 8+ GPUs read their shards in parallel directly into HBM via EFA?

GPUDirect Storage bypasses CPU and PCIe, letting all 8+ GPUs read their shards in parallel directly into HBM via EFA.

TurboQuant KV cache quantizes key-value data to increase context window size without additional HBM overhead?

TurboQuant KV cache quantizes key-value data to increase context window size without additional HBM overhead.

Developer Tools

AWS and NVIDIA slash LLM loading from minutes to seconds with GPUDirect

AWS Machine Learning Blog June 02, 2026

⚡New technique cuts 405B model load time from 15 minutes to seconds on Blackwell instances.

Deep Dive

Amazon FSx for Lustre combined with NVIDIA GPUDirect Storage (GDS) transforms the cold-start experience for LLMs on AWS. Traditional CPU-based model loading streams checkpoints through system memory and copies weights to GPUs sequentially over PCIe, taking 10–20 minutes for a 405B parameter model like Llama 3.1. During that time, expensive GPUs sit idle. The new approach pre-splits the checkpoint across tensor-parallel ranks on FSx for Lustre, allowing all GPUs to read their shards in parallel directly into HBM through EFA, bypassing the CPU entirely. On the new Amazon EC2 P6e instances (Blackwell architecture with 72 GPUs per UltraServer), this reduces cold-start time to seconds, enabling faster autoscaling, fault recovery, and cost efficiency.

Beyond model loading, the recently announced TurboQuant KV cache dramatically increases context window size by quantizing the key-value cache. This allows larger sequences without sacrificing HBM capacity, a critical feature for long-context applications like document analysis and agentic workflows. Together, GDS and TurboQuant address two major bottlenecks in LLM inference: startup latency and memory-bound context length. The solution is particularly beneficial for large-scale deployments using UltraClusters, where each node independently loads from the shared filesystem in parallel, maximizing throughput.

Key Points

Llama 3.1 405B (800GB checkpoint) loads in seconds instead of 10–20 minutes using sharded GDS on FSx for Lustre.
GPUDirect Storage bypasses CPU and PCIe, letting all 8+ GPUs read their shards in parallel directly into HBM via EFA.
TurboQuant KV cache quantizes key-value data to increase context window size without additional HBM overhead.

Why It Matters

Faster model loading enables near-instant autoscaling and fault recovery, reducing GPU idle costs for large LLMs.

Read Original Article

AWS and NVIDIA slash LLM loading from minutes to seconds with GPUDirect

Why It Matters

Related Articles

🚀 Stay Ahead in AI