[D] A mathematical proof from an anonymous Korean forum: The essence of Attention is fundamentally a d^2 problem, not n^2. (PDF included)
Korean forum paper mathematically proves Attention's true dimension is d², not n², enabling 2x faster training.
An anonymous mathematical proof originating from a Korean AI community forum, 'The Singularity Gallery,' is challenging a core assumption in large language model architecture. The paper, titled 'The d² Pullback Theorem: Why Attention is a d²-Dimensional Problem,' argues that the field has fundamentally misunderstood the intrinsic geometry of the Transformer's Attention mechanism. The author, who claims not to work in the LLM industry, presents a theorem proving that when combining the forward pass (n x n) and backward gradient (n x n), the actual optimization landscape explored by model parameters is strictly d²-dimensional, where 'd' is the embedding dimension. This suggests the notorious O(n²) computational bottleneck—where cost scales with the square of sequence length—is an illusion created by the choice of softmax normalization, which artificially inflates the rank to 'n'.
The proof contends that softmax, while creating necessary 'matching' contrast, destroys the underlying Euclidean structure. The author proposes replacing it with a degree-2 polynomial kernel (x²) in a new architecture called CSQ (Centered Shifted-Quadratic) Attention. This retains the matching property but operates within the proven d²-dimensional optimization space, theoretically reducing both training and inference complexity to O(nd³). If validated, this could provide the theoretical foundation for next-generation architectures that are significantly faster and more scalable than current Transformers, impacting models like GPT-4o and Claude 3.5. The paper is now circulating for expert verification, with the potential to redefine efficiency benchmarks in AI.
- Mathematical 'd² Pullback Theorem' proves Attention's optimization landscape is d²-dimensional, not n².
- Identifies softmax as creating an artificial O(n²) bottleneck by inflating rank.
- Proposes CSQ Attention with polynomial kernel to achieve O(nd³) complexity while preserving matching.
Why It Matters
Could enable faster, cheaper LLMs by fundamentally reducing Transformer computational complexity from O(n²) to O(nd³).