Structured Multidimensional Representation Learning for Large Language Models
New 'L-Transformer' architecture uses tensor math to shrink encoder parameters while preserving performance.
A team of researchers has proposed a novel 'Tensor Transformer' architecture that fundamentally restructures how large language models (LLMs) represent and process information. The core innovation is a structured spectral factorization of the embedding space using the L-product for third-order tensors. Instead of standard dense embeddings, token representations are reshaped into spectral tensor slices, and key operations like attention and feed-forward layers are performed in this transformed domain. This creates an architecture, dubbed the L-Transformer, that is mathematically proven to be spectrally equivalent to running multiple parallel, smaller Transformers. The result is a dramatic compression of the model's encoder component.
In practical tests, the method demonstrated its efficiency. On the IMDB sentiment analysis dataset, the tensorized encoder matched or even improved upon the accuracy of a standard Transformer baseline while using far fewer parameters. For a configuration with p=4 spectral slices, encoder parameters were reduced by approximately 75%. On the AG News topic classification task, a model with moderate width saw a small accuracy trade-off for a 4x reduction in encoder size. Crucially, at a BERT-base width (d=768), performance returned to parity with the standard model, proving the technique's viability. Beyond mere compression, the spectral decomposition introduces a useful inductive bias over embedding frequencies, allowing for slice-dependent frequency scaling that can improve the model's generalization.
The approach is fully differentiable and compatible with existing training pipelines when instantiated with a real-valued Discrete Cosine Transform (DCT). This work, detailed in the arXiv preprint 'Structured Multidimensional Representation Learning for Large Language Models,' addresses the critical scaling issue of Transformer models, where performance gains are accompanied by unsustainable parameter growth and redundancy. It opens a new pathway for building more parameter-efficient LLMs without sacrificing capability.
- Uses L-product tensor factorization to reshape embeddings into spectral slices, enabling parallel sub-transformers.
- Achieves up to 75% reduction in encoder parameters (for p=4) while maintaining competitive accuracy on IMDB and AG News.
- Fully differentiable and training-compatible, introducing a frequency-based inductive bias that can improve generalization.
Why It Matters
This technique could enable more powerful and efficient LLMs, reducing computational costs and environmental impact for training and deployment.