Research & Papers

Going from 3B/7B dense to Nemotron 3 Nano (hybrid Mamba-MoE) for multi-task reasoning — what changes in the fine-tuning playbook? [D]

r/MachineLearning April 26, 2026

⚡30B model with 3.6B active parameters tests LoRA limits

Deep Dive

An independent researcher is pioneering fine-tuning strategies for NVIDIA's Nemotron 3 Nano, a 30B-parameter hybrid Mamba-Attention-MoE model with only 3.6B active parameters. This architecture blends 23 Mamba-2 layers (selective state-space models), 23 sparse MoE layers (128 experts each with top-6 routing), and 6 GQA attention layers. The goal is multi-task reasoning: reading structural situations vs surface statements, holding multiple perspectives, surfacing load-bearing threads, and conditioning on numeric context features.

The project uses 40-80k synthetic examples generated by Sonnet 4.6 with selective Opus 4.7 on the hardest 20%, following ORCA-style explanation tuning. Running on H100 80GB via RunPod (~$120 budget), the researcher faces undocumented challenges: Can LoRA be applied to MoE router weights safely? Does Mamba-2's selective SSM state handle low-rank perturbations? How does the auxiliary load-balancing loss interact with imbalanced multi-task datasets? Does sparse routing protect against catastrophic forgetting better than dense models?

Key Points

Nemotron 3 Nano has 30B total parameters but only 3.6B active per token via sparse MoE routing
Architecture includes 23 Mamba-2 layers, 23 sparse MoE layers (128 experts each), and 6 GQA attention layers
Project uses 40-80k synthetic examples from Sonnet 4.6/Opus 4.7 on H100 80GB via RunPod

Why It Matters

This exploration could unlock efficient fine-tuning of hybrid architectures, reducing compute costs for multi-task reasoning.

Read Original Article

Going from 3B/7B dense to Nemotron 3 Nano (hybrid Mamba-MoE) for multi-task reasoning — what changes in the fine-tuning playbook? [D]

Why It Matters

Stay Ahead in AI