Viral Wire

MiniMax M3: 1M token context, native multimodal AI

New frontier model handles images, video, and code with sparse attention.

Deep Dive

MiniMax has unveiled its M3 model, a new frontier-level multimodal AI that pushes the boundaries of context length and native multimodality. The M3 features a massive 1 million token context window, enabling it to process entire books or lengthy codebases in a single pass. Unlike many models that require separate plugins for vision, M3 natively handles image and video input, allowing it to analyze visual content directly. It excels in specialized tasks such as coding, agentic workflows (AI that can autonomously take actions), and complex reasoning.

A key innovation is the MiniMax Sparse Attention (MSA) architecture, which selectively focuses on the most relevant parts of the input, reducing computational overhead and making the model highly efficient despite its large context. This design enables faster inference and lower costs, making the M3 practical for real-world enterprise applications. MiniMax positions the M3 as a competitor to models like GPT-4o and Gemini, but with a unique focus on ultra-long context and native multimodality at reduced resource consumption.

Key Points
  • 1 million token context window for processing entire books or codebases
  • Native support for image and video input without external plugins
  • MiniMax Sparse Attention (MSA) architecture improves efficiency and speed

Why It Matters

Long-context multimodal AI enables richer analysis of documents, images, and video in one model.