500K context window fits in just 48GB VRAM using a quantized MoE architecture?

500K context window fits in just 48GB VRAM using a quantized MoE architecture

Sustains 21 tokens per second inference speed on dual Titan RTX GPUs?

Sustains 21 tokens per second inference speed on dual Titan RTX GPUs

Despite being math-tuned, it excels at agentic coding tasks in real-world projects?

Despite being math-tuned, it excels at agentic coding tasks in real-world projects

Open Source

Nemotron-3 Super runs 500K context on 48GB VRAM at 21 tok/s

r/LocalLLaMA May 12, 2026

⚡A quantized 64B math model surprises coders by handling 500K tokens on dual Titan RTX

Deep Dive

A Reddit user, /u/Express_Quail_1493, found a model on HuggingFace called Nemotron-3-Super-64B-A12B-Math-REAP-GGUF that appears tuned for math but works surprisingly well for agentic coding. They've been using it on their "potato dual TITAN RTX" for all coding projects for a week and are impressed. The user says they "wouldnt dream of having 500k tokens" on that setup, and invites others to try it and comment on where it breaks and for what use cases.

Key Points

500K context window fits in just 48GB VRAM using a quantized MoE architecture
Sustains 21 tokens per second inference speed on dual Titan RTX GPUs
Despite being math-tuned, it excels at agentic coding tasks in real-world projects

Why It Matters

Democratizes large-context AI coding agents to users with consumer-grade GPUs, not just enterprise clusters.

Read Original Article

Nemotron-3 Super runs 500K context on 48GB VRAM at 21 tok/s

Why It Matters

Related Articles

🚀 Stay Ahead in AI