SubQ just blew my mind - 12M token context with sub-quadratic attention
Processes 1M tokens 52x faster than FlashAttention and costs under 5% of Claude Opus.
SubQ, developed by an independent team, introduces the first large language model built on a fully sub-quadratic sparse-attention architecture (SSA). This breakthrough enables a 12 million token context window—dwarfing the typical 128K-200K limits of models like GPT-4 or Claude. In benchmarks, SubQ processes 1 million tokens 52x faster than FlashAttention, the standard efficient attention mechanism, while costing under 5% of Claude Opus for equivalent context lengths.
The key innovation is that SSA allocates compute only to the most relevant token relationships, rather than attending to every pair. This linear scaling eliminates the quadratic explosion that normally makes long-context inference prohibitively slow and expensive. For developers, this means agentic coding tools can now hold entire codebases in context without chunking; researchers can feed full-length books or papers into a single prompt. SubQ’s efficiency could democratize access to ultra-long-context AI, shifting the economics of large-scale document analysis and multi-step reasoning tasks.
- 12 million token context window, using fully sub-quadratic sparse-attention (SSA) architecture.
- Processes 1M tokens 52x faster than FlashAttention at under 5% the cost of Claude Opus.
- Linear scaling enables practical long-context tasks like agentic coding with full codebases and research without chunking.
Why It Matters
SubQ eliminates context-window bottlenecks, making ultra-long-context AI affordable and fast for professionals and developers.