Research & Papers

New proof shows Transformers achieve near-optimal approximation with just 2 encoder blocks

Shallow Transformers can approximate complex functions using only two encoder blocks, a new paper proves.

Deep Dive

A new theoretical paper by Zhongjie Shi and Wenjing Liao (arXiv:2605.08811) provides rigorous approximation and generalization guarantees for Transformer networks on regression tasks over compact Euclidean domains and Riemannian manifolds. The authors propose a constructive framework where the attention mechanism builds local approximations of the target function via affine transformations of the input, then aggregates them into a global output using a softmax-based partition of unity. This mimics how Transformers spatially localize information. The architecture studied is dense, shallow (only two encoder blocks), and wide, with sinusoidal positional encodings and standard single-hidden-layer point-wise feed-forward networks, closely matching real-world implementations.

From an approximation perspective, the paper proves that for α-Hölder continuous functions (α in (0,1]), the model achieves uniform ε-approximation error using O(ε^{-d/α}) total parameters—a result that shows how even shallow Transformers can efficiently represent smooth functions as dimensionality d increases. Building on this, the authors derive a near minimax-optimal generalization error bound of order O(n^{-2α/(2α+d)} log n) for the empirical risk minimizer, where n is training data size. This bound matches the minimax lower bound up to a logarithmic factor, confirming that Transformers with the right architecture are statistically optimal for learning Hölder smooth functions. The work provides a theoretical foundation for why practical shallow Transformers (e.g., in vision or NLP) work so well, and offers guidance for designing efficient, parameter-efficient models.

Key Points
  • Only two encoder blocks and single-hidden-layer FFNs needed for uniform ε-approximation of α-Hölder functions
  • Parameter count scales as O(ε^{-d/α}), making shallow Transformers efficient as input dimension d grows
  • Generalization error bound is near minimax-optimal: O(n^{-2α/(2α+d)} log n) for training size n

Why It Matters

Provides a theoretical basis for shallow Transformers' efficiency, potentially guiding smaller, faster models for real-world applications.