HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet
New vision model eliminates scanning bottlenecks, achieving state-of-the-art SSM accuracy while cutting inference time in half.
Researchers Badri N. Patro and Vijay S. Agneeswaran have unveiled HAMSA, a novel Vision State Space Model (SSM) that fundamentally rethinks how these models process 2D images. Traditional SSMs like Vim and VMamba rely on complex scanning strategies to adapt sequential processing to spatial data, creating computational overhead and architectural complexity. HAMSA bypasses this entirely by operating directly in the spectral (frequency) domain using Fast Fourier Transform (FFT)-based convolution. This core shift eliminates the need for sequential scanning, achieving a more efficient O(L log L) complexity and simplifying the model architecture.
HAMSA introduces three key technical innovations to make this spectral approach work. First, it simplifies kernel parameterization by replacing traditional (A, B, C) matrices with a single Gaussian-initialized complex kernel, which also eliminates discretization instabilities common in other SSMs. Second, its SpectralPulseNet (SPN) mechanism acts as an input-dependent frequency gate, allowing the model to adaptively modulate which spectral components are most important. Third, the Spectral Adaptive Gating Unit (SAGU) uses magnitude-based gating to ensure stable gradient flow during training in the frequency domain.
The performance results are striking. On the ImageNet-1K benchmark, HAMSA reaches 85.7% top-1 accuracy, making it the new state-of-the-art among SSMs for vision. More impressively, it achieves this with significantly better efficiency: it runs 2.2 times faster than comparable transformers (4.2ms vs 9.2ms for a DeiT-S model) and is 1.4 to 1.9 times faster than other scanning-based SSMs. It also uses substantially less memory (2.1GB vs 3.2-4.5GB) and energy (12.5J vs 18-25J), making it a compelling option for deployment. The model also demonstrates strong generalization capabilities in transfer learning and dense prediction tasks, suggesting broad applicability beyond simple image classification.
- Achieves 85.7% top-1 accuracy on ImageNet-1K, setting a new state-of-the-art for Vision SSMs.
- Runs 2.2x faster than transformers (4.2ms vs 9.2ms) and uses 2.1GB memory, offering major efficiency gains.
- Introduces SpectralPulseNet for adaptive frequency gating and a simplified kernel, eliminating scanning and discretization issues.
Why It Matters
This breakthrough in efficiency and accuracy could enable faster, cheaper, and more capable computer vision models for real-world applications like robotics and autonomous systems.