AI Safety

How did ‘large’ language models get that way? The role of Transformers and Pretraining in GPT

Self-supervised learning on internet-scale data is the real secret behind LLM size.

Deep Dive

In his LessWrong essay, Oliver Sourbut traces how large language models (LLMs) like GPT grew to their enormous scale. He argues the answer lies in the 'P' (pretrained) and 'T' (transformer) of GPT. Self-supervised learning—training a model to predict the next token on vast unlabeled internet text—provides the bulk of the 'cake,' as Yann LeCun famously analogized. This approach sidesteps the expense of curated, human-labeled datasets and forces the model to learn generalizable linguistic concepts. The Transformer architecture, with its attention mechanism, efficiently handles long-range dependencies in sequences, enabling models to process entire paragraphs or documents as context. This combination allowed scaling from millions to billions of parameters, driving the exponential growth in model size.

Sourbut notes that while this self-supervised pretraining was the foundation, the paradigm has evolved by 2025–2026, with reinforcement learning and supervised fine-tuning playing larger roles (the 'icing' and 'cherry'). The essay is part of a series examining implications for AI reasoning and explainability, hinting that current trends may sacrifice some of the transparency that emerged from this training regime. For professionals, understanding this history clarifies why LLMs are so data-hungry and compute-intensive, and why future shifts in training methodology could change their capabilities and interpretability.

Key Points
  • Self-supervised learning (predicting next token) on internet-scale data provided the 'bulk of the cake' for LLMs, avoiding expensive human labeling.
  • The Transformer architecture's attention mechanism enabled handling long sequences, critical for language coherence and scaling to billions of parameters.
  • LeCun's 2016 analogy (unsupervised as bulk, supervised as icing, RL as cherry) described early LLM training, but by 2026 RL and fine-tuning have become more central.

Why It Matters

Explains the core drivers behind LLM scaling—critical for understanding AI costs, data needs, and future model shifts.