InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model
New attention-free architecture combines Mamba's efficiency with Transformer-level accuracy across vision and language tasks.
A research team led by Youjin Wang has introduced InfoMamba, a novel hybrid architecture that addresses fundamental limitations in current sequence modeling approaches. The model combines the linear scaling efficiency of Mamba-style selective state-space models (SSMs) with the strong global interaction capabilities of Transformers, but without using traditional quadratic-complexity self-attention. Instead, InfoMamba replaces token-level attention with a concept bottleneck linear filtering layer that acts as a minimal-bandwidth global interface, then fuses this with a selective recurrent stream through an information-maximizing fusion (IMF) mechanism.
The key innovation lies in the IMF approach, which dynamically injects global context into SSM dynamics while encouraging complementary information usage through a mutual-information-inspired objective. This allows the model to capture both fine-grained local patterns and long-range dependencies that pure SSMs often miss. Extensive experiments across diverse domains—including classification, dense prediction, and non-vision tasks—demonstrate that InfoMamba consistently outperforms strong Transformer and SSM baselines while maintaining computational efficiency.
The researchers' consistency boundary analysis provides theoretical grounding, characterizing when diagonal short-memory SSMs can approximate causal attention and identifying structural gaps that remain. This analysis directly motivated the hybrid design, which bridges these gaps through the global interface and fusion mechanism. The result is a practical architecture that offers competitive accuracy-efficiency trade-offs, potentially enabling longer-context processing and more complex reasoning tasks without the prohibitive computational costs of pure Transformer models.
- Replaces quadratic self-attention with linear concept bottleneck filtering for global context
- Uses information-maximizing fusion to dynamically inject global information into SSM dynamics
- Outperforms Transformer and SSM baselines while maintaining near-linear scaling across multiple task types
Why It Matters
Enables longer-context AI models with Transformer-level accuracy but dramatically lower computational costs for real-world deployment.