Open Source

nvidia/gpt-oss-puzzle-88B · Hugging Face

r/LocalLLaMA March 26, 2026

⚡NVIDIA shrinks a 120B model to 88B using Puzzle NAS, achieving major throughput gains without sacrificing accuracy.

Deep Dive

NVIDIA has unveiled GPT-OSS-Puzzle-88B, a new large language model engineered for superior inference efficiency on its own hardware. Derived from OpenAI's GPT-OSS-120B, the 88-billion-parameter model was created using NVIDIA's proprietary Puzzle framework, a post-training neural architecture search (NAS) system. The core achievement is a dramatic reduction in model size—down to roughly 73% of its parent—without compromising the reasoning accuracy that makes these models valuable. This optimization specifically targets the bottlenecks of modern AI inference: KV-cache bandwidth and memory capacity, which often limit performance more than raw compute power on GPUs like the H100.

The performance gains are substantial and hardware-specific. On a single NVIDIA H100 GPU, the model achieves up to a 2.82x improvement in throughput. When scaled to an 8xH100 node, it shows a 1.63x throughput boost for long-context workloads (64K input/64K output) and a 1.22x improvement for short-context (4K/4K) scenarios. Architecturally, it is a Mixture-of-Experts (MoE) decoder-only transformer, featuring a modified version of the GPT-OSS architecture with a varying number of experts per layer and an adjusted global/window attention pattern. This design makes it particularly adept at the complex, multi-step reasoning tasks that are increasingly critical for enterprise AI applications.

By demonstrating that a carefully pruned and architecturally searched model can outperform its larger predecessor in real-world serving metrics, NVIDIA is making a strong case for inference-optimized model design. This release is as much a showcase for the capabilities of the Puzzle NAS framework as it is for the model itself, highlighting a path forward where efficiency is engineered in, not just scaled out with more parameters.

Key Points

Uses NVIDIA's Puzzle NAS framework to shrink OpenAI's 120B model to 88B parameters (27% reduction).
Delivers up to 2.82x higher throughput on a single H100 GPU and 1.63x improvement on 8-GPU nodes for long-context tasks.
Maintains or slightly exceeds the accuracy of the larger parent model across reasoning benchmarks, targeting KV-cache bottlenecks.

Why It Matters

This proves major inference speedups are possible without accuracy loss, directly lowering the cost and latency of running advanced AI models in production.

Read Original Article

nvidia/gpt-oss-puzzle-88B · Hugging Face

Why It Matters

Stay Ahead in AI