Research & Papers

SM1 variant runs Mamba1 with 16x less memory on Blackwell in pure PyTorch

d_state=1 closed-form solution eliminates selective scan intermediates entirely.

Deep Dive

A developer known as TechnoVoyager has released SM1 (Scalar Mamba1), a pure PyTorch implementation of the Mamba1 state-space model optimized for NVIDIA's Blackwell architecture. The key innovation is exploiting the d_state=1 boundary, where the recurrence admits a closed-form solution via the variation of parameters method. This replaces the computationally heavy selective scan with two simple PyTorch ops: cumulative product and cumulative sum. The result is not an approximation—it's mathematically identical to sequential computation in floating point precision, yet eliminates the entire S dimension from scan intermediates. Compared to a standard Mamba1 with d_state=16, SM1 uses 16× less memory during the scan, a critical advantage for large models on limited hardware.

The inference benefits are equally striking. For a 130M parameter model, the inference state consists of just 14,080 floats—barely 56 KB. There is no KV cache, and per-token computation remains O(1) indefinitely. This makes the model extremely efficient for long-context applications like music generation, which is exactly what SM1 is being trained on: 163K MIDI files representing ~2.5B tokens in a custom format. The entire 130M parameter training fits in under 8 GB of VRAM (half of an RTX 5060 Ti). While the approach is limited to d_state=1 (d_state>1 breaks the closed form), the tradeoff opens state-space models to consumer-grade GPUs and real-time inference with minimal memory overhead.

Key Points
  • Closed-form recurrence using torch.cumprod and torch.cumsum replaces the selective scan—exact, not approximate.
  • Memory reduction of 16× compared to standard Mamba1 with d_state=16 by eliminating the S dimension.
  • 130M parameter model inference state is 56 KB with no KV cache, fitting entirely on an RTX 5060 Ti for training.

Why It Matters

Enables lightweight, long-context state-space models on consumer GPUs, opening music generation and other tasks to broader hardware.