Research & Papers

Learning to Recall with Transformers Beyond Orthogonal Embeddings

A new theoretical analysis explains the 'storage capacity' of transformers trained on messy, real-world data.

Deep Dive

A team of researchers from Stanford, NYU, and USC has published a significant theoretical paper, 'Learning to Recall with Transformers Beyond Orthogonal Embeddings,' accepted at ICLR 2026. The work tackles a major gap in understanding how large language models (LLMs) like GPT-4 and Claude 3 actually store and retrieve knowledge. Prior theoretical analyses often relied on unrealistic, idealized assumptions like infinite data or perfectly orthogonal token embeddings. This new research instead analyzes a transformer trained with empirical gradient descent on finite data with random, non-orthogonal embeddings—a much closer match to real-world training.

By studying a simplified token-retrieval task, the researchers derived explicit formulas for a model's 'storage capacity.' Their core finding is a multiplicative scaling law: the number of examples a model can reliably learn (N) scales with the product of the embedding dimension (d) and the sequence length (L), expressed as N ~ d * L. This means a model's ability to memorize and recall information isn't just about dataset size or model width alone, but a specific interplay between these architectural and data parameters. The team validated this scaling numerically and provided a statistical lower bound, showing it's an intrinsic property under realistic conditions.

This work moves AI theory closer to engineering practice. For developers training models, it offers a more grounded mathematical framework for predicting how model scale (embedding size, context length) interacts with dataset size to achieve reliable factual recall. It helps explain the fundamental mechanisms behind the 'knowledge' in models powering search and question-answering, providing a crucial step toward more predictable and efficient LLM development beyond trial and error.

Key Points
  • The analysis moves beyond idealized 'orthogonal embeddings' to study transformers trained on finite data with random embeddings, matching real-world conditions.
  • It reveals a multiplicative scaling law for storage capacity: sample size (N) scales with embedding dimension (d) times sequence length (L), or N ~ d * L.
  • The findings provide a theoretical foundation for how modern LLMs encode and retrieve facts, impacting how we scale models for knowledge-intensive tasks.

Why It Matters

Provides a mathematical blueprint for how real LLMs store knowledge, guiding more efficient model scaling and architecture design for better factual recall.