Open Source

Nemotron-3 Super runs 500K context on 48GB VRAM at 21 tok/s

A quantized 64B math model surprises coders by handling 500K tokens on dual Titan RTX

Deep Dive

A Reddit user, /u/Express_Quail_1493, found a model on HuggingFace called Nemotron-3-Super-64B-A12B-Math-REAP-GGUF that appears tuned for math but works surprisingly well for agentic coding. They've been using it on their "potato dual TITAN RTX" for all coding projects for a week and are impressed. The user says they "wouldnt dream of having 500k tokens" on that setup, and invites others to try it and comment on where it breaks and for what use cases.

Key Points
  • 500K context window fits in just 48GB VRAM using a quantized MoE architecture
  • Sustains 21 tokens per second inference speed on dual Titan RTX GPUs
  • Despite being math-tuned, it excels at agentic coding tasks in real-world projects

Why It Matters

Democratizes large-context AI coding agents to users with consumer-grade GPUs, not just enterprise clusters.