Research & Papers

[D] Is language modeling fundamentally token-level or sequence-level?

A viral Reddit discussion questions if next-token prediction is fundamentally flawed for training AI.

Deep Dive

A technical debate is going viral on Reddit's Machine Learning community, questioning a foundational assumption in AI: is language modeling about predicting the next token, or generating coherent sequences? User 36845277's post highlights a core tension. During pretraining, models like GPT-4 and Llama 3 are trained with a token-level cross-entropy loss, which calculates the sum of -log P(next token | previous tokens) divided by the total number of tokens. However, during the alignment phase—such as Reinforcement Learning from Human Feedback (RLHF) or methods like GRPO—rewards are assigned to entire sequences, not individual tokens. This creates a fundamental mismatch in how the model learns.

This discrepancy has real-world implications. The post suggests problems like repetitive outputs might stem from models being optimized for myopic, token-by-token prediction rather than holistic sequence quality. It cites research like "Long Horizon Temperature Scaling" (Shih et al., 2023), which argues standard token-level temperature scaling during sampling is short-sighted and proposes methods for more sequence-aware generation. The discussion seeks to unify these perspectives, asking if a sequence-level view during pretraining could lead to better base models and what the most principled framing for language truly is.

Key Points
  • Core Tension Identified: Pretraining uses token-level loss (sum of -log P(next token)), while alignment (RLHF/GRPO) uses sequence-level rewards.
  • Potential Flaw: This mismatch may cause model issues like repetition, as models aren't trained to optimize for full sequence coherence.
  • Research Cited: Work like Long Horizon Temperature Scaling (Shih et al., 2023) attempts to correct 'myopic' token-level sampling.

Why It Matters

Challenging this core assumption could lead to more coherent AI models and new, more principled training methodologies.