Closed-form recurrence using torch.cumprod and torch.cumsum replaces the selective scan—exact, not approximate?

Closed-form recurrence using torch.cumprod and torch.cumsum replaces the selective scan—exact, not approximate.

Memory reduction of 16× compared to standard Mamba1 with d_state=16 by eliminating the S dimension?

Memory reduction of 16× compared to standard Mamba1 with d_state=16 by eliminating the S dimension.

130M parameter model inference state is 56 KB with no KV cache, fitting entirely on an RTX 5060 Ti for training?

130M parameter model inference state is 56 KB with no KV cache, fitting entirely on an RTX 5060 Ti for training.

Research & Papers

SM1 variant runs Mamba1 with 16x less memory on Blackwell in pure PyTorch

r/MachineLearning May 23, 2026

⚡d_state=1 closed-form solution eliminates selective scan intermediates entirely.

Deep Dive

A developer known as TechnoVoyager has released SM1 (Scalar Mamba1), a pure PyTorch implementation of the Mamba1 state-space model optimized for NVIDIA's Blackwell architecture. The key innovation is exploiting the d_state=1 boundary, where the recurrence admits a closed-form solution via the variation of parameters method. This replaces the computationally heavy selective scan with two simple PyTorch ops: cumulative product and cumulative sum. The result is not an approximation—it's mathematically identical to sequential computation in floating point precision, yet eliminates the entire S dimension from scan intermediates. Compared to a standard Mamba1 with d_state=16, SM1 uses 16× less memory during the scan, a critical advantage for large models on limited hardware.

The inference benefits are equally striking. For a 130M parameter model, the inference state consists of just 14,080 floats—barely 56 KB. There is no KV cache, and per-token computation remains O(1) indefinitely. This makes the model extremely efficient for long-context applications like music generation, which is exactly what SM1 is being trained on: 163K MIDI files representing ~2.5B tokens in a custom format. The entire 130M parameter training fits in under 8 GB of VRAM (half of an RTX 5060 Ti). While the approach is limited to d_state=1 (d_state>1 breaks the closed form), the tradeoff opens state-space models to consumer-grade GPUs and real-time inference with minimal memory overhead.

Key Points

Closed-form recurrence using torch.cumprod and torch.cumsum replaces the selective scan—exact, not approximate.
Memory reduction of 16× compared to standard Mamba1 with d_state=16 by eliminating the S dimension.
130M parameter model inference state is 56 KB with no KV cache, fitting entirely on an RTX 5060 Ti for training.

Why It Matters

Enables lightweight, long-context state-space models on consumer GPUs, opening music generation and other tasks to broader hardware.

Read Original Article

SM1 variant runs Mamba1 with 16x less memory on Blackwell in pure PyTorch

Why It Matters

Related Articles

🚀 Stay Ahead in AI