Why SSMs struggle in parameter-constrained training: empirical findings at 25M parameters [R]
State space models compress 3.26x worse than attention under LZMA at 25M params
In a deep dive following three weeks of OpenAI's Parameter Golf competition, researcher mradassaad presents empirical evidence that State Space Models (SSMs) are structurally disadvantaged compared to Transformers under tight time and size constraints. The competition limits models to 25 million parameters, 10 minutes of training, a 16MB compressed artifact, and runs on 8xH100 GPUs. The central finding: SSM in_proj weights compress up to 3.26x worse than Transformer attention QKV weights when using the LZMA compression algorithm. This directly strains the compressed parameter budget, giving Transformers a clear advantage in this regime.
Further analysis reveals that architectural decisions that appeared favorable at smaller vocabulary sizes (SP4096) reversed direction at the target vocabulary (SP8192), highlighting how performance gains can be non-transferable. The post also details three kernel-level experiments with Mamba-3 Triton kernels: a backward fusion attempt that was numerically exact but 16% slower due to shared memory pressure, a torch.compile quantizer bug that introduced a 5.5 mBPB penalty, and a mixed-precision dynamics protection that recovered 0.8 mBPB at negligible size cost. These findings underscore the practical engineering challenges when adapting SSMs to constrained settings.
- SSM in_proj weights compress 3.26x worse than Transformer QKV under LZMA, consuming more of the 16MB artifact budget
- Architectural wins at vocabulary size 4096 flipped sign at the target size 8192, indicating non-transferable improvements
- Three kernel experiments revealed a 16% slowdown from backward fusion, a 5.5 mBPB loss from a quantization bug, and a 0.8 mBPB recovery from mixed-precision fixes
Why It Matters
These findings guide model architects choosing between SSMs and Transformers under strict compute and storage constraints.