Research & Papers

Amazon's Surefire models boost LLM throughput 47% with no accuracy loss

Amazon researchers design LLMs that run 47% faster with zero accuracy tradeoff.

Deep Dive

Amazon researchers Tao Yu and Youngsuk Park presented a new scaling law at ICLR that addresses the accuracy-efficiency tradeoff in large language models. While the 2022 Chinchilla scaling law from DeepMind optimized model size and training data for a given compute budget, it ignored internal architecture choices. The Amazon team discovered that two models with the same parameter count and accuracy can differ by up to 40% in inference throughput depending on hidden size and the ratio of MLP to attention parameters. They introduced Surefire models, which incorporate these architectural factors directly into the scaling law, enabling practitioners to predict the best configurations for speed without sacrificing loss.

Surefire models match or exceed LLaMA-3.2 accuracy while improving throughput by up to 47%. The optimal MLP-to-attention ratio is around 1.0, far lower than the 4.8 seen in LLaMA-3.2-1B, indicating that many current models over-allocate parameters to MLP layers. These gains are consistent across both A100 and H200 GPUs and multiple serving frameworks. The work provides a practical framework for designing more efficient LLMs for real-time applications, bridging the gap between theoretical scaling laws and real-world deployment constraints.

Key Points
  • Surefire models achieve up to 47% higher inference throughput without any loss of accuracy compared to LLaMA-3.2 baselines.
  • Optimal MLP-to-attention ratio is ~1.0, dramatically lower than the 4.8 ratio of LLaMA-3.2-1B, reducing redundant parameters.
  • Consistent throughput gains observed across A100 and H200 GPUs and multiple serving frameworks, ensuring broad applicability.

Why It Matters

Enables faster, cheaper real-time AI apps without compromising quality—critical for production deployment.