Research & Papers

EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

Simple EMA traces achieve 96% of BiGRU performance on grammar tasks but fail at language modeling.

Deep Dive

Researcher Arth Singh's paper 'EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context' uses exponential moving average (EMA) traces as a minimalist probe to understand what simple recurrent models can and cannot learn. EMA traces represent the simplest possible recurrent context—no gating mechanisms, no content-based retrieval, just fixed-coefficient accumulation of information over time. The study reveals these traces excel at capturing temporal structure: a multi-timescale EMA architecture achieved 96% of a supervised BiGRU's performance on grammatical role assignment tasks, even surpassing it on structure-dependent roles, all without any labeled training data.

However, the research exposes EMA's critical weakness: it destroys token identity through lossy, data-independent compression. A 130M-parameter language model using only EMA context reached a C4 perplexity of 260—eight times worse than GPT-2's performance. Crucially, replacing the model's linear predictor with full softmax attention yielded identical loss, proving the entire performance gap originates in the EMA traces themselves. By the data processing inequality, no downstream predictor can recover the information EMA discards. The paper concludes that fixed-coefficient accumulation, whether across time or network depth, suffers irreversible information dilution that only learned, input-dependent selection mechanisms (like those in transformers or gated RNNs) can resolve.

Key Points
  • EMA traces achieve 96% of supervised BiGRU performance on grammatical role assignment with zero labels
  • A 130M-param LM using only EMA context reaches C4 perplexity 260 (8x worse than GPT-2)
  • The entire performance gap localizes to EMA traces, proving they cause irreversible information loss

Why It Matters

Clarifies fundamental limits of simple recurrent architectures, guiding design of more efficient sequence models for AI systems.