HyperMLP: An Integrated Perspective for Sequence Modeling
A new paper claims a simpler MLP-based approach can beat traditional transformer attention.
Researchers propose HyperMLP and HyperGLU, a new perspective that views an autoregressive attention head as a dynamic two-layer MLP. This formulation treats attention scores as a growing hidden representation, using standard MLP activations like ReLU for input-conditioned selection. The paper provides theoretical characterizations and shows that under matched parameter budgets, HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines, challenging a core tenet of modern transformer architecture design.
Why It Matters
This could lead to simpler, more efficient architectures that outperform today's transformer-based models.