MiniMax's MSA architecture delivers 1M context at 4x speed
4x faster than Flash-Sparse-Attention with 1/20th per-token compute at full context.
MiniMax has unveiled MiniMax Sparse Attention (MSA), a novel attention architecture that natively scales context windows to 1M tokens without the quadratic complexity typical of standard transformers. Instead of relying on sparse approximations that degrade recall, MSA restructures memory access patterns at the operator level using a “KV outer gather Q” approach. This treats KV blocks as the outer loop to aggregate hit queries, ensuring that hardware memory reads remain strictly contiguous and each block is fetched exactly once. The result is dramatically improved low-level performance: 4x faster execution compared to Flash-Sparse-Attention, per-token compute dropping to 1/20th of their previous-generation models at full 1M context, a 9x speedup in prefilling, and a 15x speedup in decoding phases.
The architecture is optimized for hardware-level data transport and memory layouts, making it ideal for sustained, long-horizon agent execution. MiniMax also claims MSA is the first open-weight model to combine frontier-level coding ability, 1M natural context length, and native multimodality. By sidestepping the traditional trade-off between context length and computational efficiency, MSA enables professionals to run large-scale retrieval, long-document analysis, and multi-turn agent workflows without excessive cost or latency penalties. This breakthrough could accelerate adoption of truly long-context AI systems in enterprise applications.
- MSA achieves 4x faster execution than Flash-Sparse-Attention and 15x faster decoding at full 1M context.
- Per-token compute drops to 1/20th of MiniMax's previous-gen models at 1M context depth.
- First open-weight model to combine frontier coding, 1M native context, and native multimodality.
Why It Matters
Long-context AI agents are now practical at scale, slashing compute costs and latency for enterprise workflows.