Open Source

NVIDIA's Nemotron-3 hybrid hits perfect 500K token retrieval on 4×3090

Mamba layers eliminate KV cache growth, achieving 23 t/s decode at half a million tokens.

Deep Dive

NVIDIA's Nemotron-3-Super-120B-A12B (hybrid Mamba2 + periodic attention + MoE) achieves perfect needle retrieval up to 504,482 tokens running locally on 4×3090 GPUs. The model uses Mamba/SSM layers that maintain a constant-size recurrent state instead of a growing KV cache, making long contexts nearly free in terms of VRAM. Quantized to i1-Q4_K_S GGUF (71GB total) with q8_0 KV cache, it fits entirely on 4×3090s (~20GB per card).

Performance numbers are striking: decode speed drops from 72 t/s on short contexts to 23 t/s at 504K tokens – still about 2.7× faster than a comparable full-attention MoE (MiniMax-M2.7-REAP) at just 30K context on the same hardware. Prefill reaches 885 t/s at 504K tokens. Needle-in-haystack tests across 10/50/90% depth show exact recall at every depth tested up to 504k tokens, no misses. However, recency bias remains: hard rules buried early in the context can be overwritten by later instructions – suggesting system prompts or end-of-context rules are more reliable for long-context agents.

Key Points
  • Nemotron-3-Super uses Mamba2 layers with constant-size state, eliminating KV cache growth for near-free long context.
  • Achieves perfect 504K token needle recall on 4×3090 with 23 t/s decode – 2.7× faster than full-attention MoE at 30K.
  • Recency bias persists: instructions buried early can be overwritten by later content; place critical rules at the end.

Why It Matters

Long-context AI inference becomes practical on consumer hardware, enabling real-time document analysis and agent workflows without cloud costs.

📬 Get the top 10 AI stories daily