nvidia/gpt-oss-puzzle-88B · Hugging Face
NVIDIA shrinks a 120B model to 88B using Puzzle NAS, achieving major throughput gains without sacrificing accuracy.
NVIDIA has unveiled GPT-OSS-Puzzle-88B, a new large language model engineered for superior inference efficiency on its own hardware. Derived from OpenAI's GPT-OSS-120B, the 88-billion-parameter model was created using NVIDIA's proprietary Puzzle framework, a post-training neural architecture search (NAS) system. The core achievement is a dramatic reduction in model size—down to roughly 73% of its parent—without compromising the reasoning accuracy that makes these models valuable. This optimization specifically targets the bottlenecks of modern AI inference: KV-cache bandwidth and memory capacity, which often limit performance more than raw compute power on GPUs like the H100.
The performance gains are substantial and hardware-specific. On a single NVIDIA H100 GPU, the model achieves up to a 2.82x improvement in throughput. When scaled to an 8xH100 node, it shows a 1.63x throughput boost for long-context workloads (64K input/64K output) and a 1.22x improvement for short-context (4K/4K) scenarios. Architecturally, it is a Mixture-of-Experts (MoE) decoder-only transformer, featuring a modified version of the GPT-OSS architecture with a varying number of experts per layer and an adjusted global/window attention pattern. This design makes it particularly adept at the complex, multi-step reasoning tasks that are increasingly critical for enterprise AI applications.
By demonstrating that a carefully pruned and architecturally searched model can outperform its larger predecessor in real-world serving metrics, NVIDIA is making a strong case for inference-optimized model design. This release is as much a showcase for the capabilities of the Puzzle NAS framework as it is for the model itself, highlighting a path forward where efficiency is engineered in, not just scaled out with more parameters.
- Uses NVIDIA's Puzzle NAS framework to shrink OpenAI's 120B model to 88B parameters (27% reduction).
- Delivers up to 2.82x higher throughput on a single H100 GPU and 1.63x improvement on 8-GPU nodes for long-context tasks.
- Maintains or slightly exceeds the accuracy of the larger parent model across reasoning benchmarks, targeting KV-cache bottlenecks.
Why It Matters
This proves major inference speedups are possible without accuracy loss, directly lowering the cost and latency of running advanced AI models in production.