Open Source

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

Extract 23B and 12B models zero-shot from a single 30B checkpoint with shared KV cache.

Deep Dive

NVIDIA's Star Elastic introduces a novel inference strategy for large language models: a single checkpoint that contains three reasoning models (30B, 23B, and 12B) that can be extracted zero-shot. This is achieved through a post-training method applied to Nemotron Nano v3. A learnable router trained via Gumbel-Softmax maps any target parameter budget to the optimal nested configuration across all elastic axes — attention heads, Mamba SSM heads, MoE experts, FFN channels, and embedding dimensions. The importance-based ranking that orders these components is computed before training begins, allowing seamless nesting without retraining.

The key insight is matching model size to phase complexity. Star Elastic's budget control assigns the 23B submodel to the thinking phase (high-volume, tolerance of lower capacity) and the 30B model to the final answer (low-volume, requires precision). This yields +16% accuracy over standard budget control and 1.9× lower latency, measured on AIME-2025, GPQA, LiveCodeBench v5, and MMLU-Pro. Cost reduction is massive: 360× fewer tokens vs. pretraining each variant from scratch, and 7× fewer vs. state-of-the-art sequential compression.

Hardware accessibility is a major win. The 12B NVFP4 variant runs on an RTX 5080 where every BF16 configuration runs out of memory. On an RTX Pro 6000 it reaches 7,426 tokens/s — 3.4× the throughput of the 30B BF16 baseline. This makes sophisticated reasoning models viable for local deployment, enabling workflows like generating a 'book of reasoning' on the 12B model at 7,000 t/s, then sliding up to 30B to evaluate what's good.

Key Points
  • One checkpoint contains 30B, 23B, and 12B reasoning models extractable via zero-shot slicing.
  • Elastic budget control assigns smaller models to thinking, full model to answers, boosting accuracy by 16% and reducing latency by 1.9×.
  • 12B NVFP4 variant runs on RTX 5080; up to 7,426 tokens/s on RTX Pro 6000 — 3.4× more throughput than 30B BF16.

Why It Matters

Star Elastic makes large reasoning models more accessible and efficient for local deployment and inference.