Research & Papers

Tensor Cache boosts long-context LLMs with eviction-based associative memory

New two-level cache stores evicted tokens as compact matrix, slashing memory while preserving accuracy

Deep Dive

Autoregressive Transformer models suffer from KV caches that grow linearly with context length. Sliding-window caching bounds memory but discards evicted tokens entirely, making distant evidence inaccessible. A new paper from Kabir Swain (MIT) and collaborators at IBM Research proposes Tensor Cache, a two-level cache that pairs sliding-window softmax attention (L1) with a fixed-size outer-product fast-weight memory (L2) fed by KV pairs evicted from the window. Recent tokens stay in exact local attention; evicted pairs are compressed into a per-layer matrix A and read by future queries through a single matrix multiplication, exploiting the linear-attention identity. A learned scalar gate fuses L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end—offering fine-grained control over memory retention.

The paper also identifies and fixes a subtle bug in prior outer-product memory training: the common chunked-mean approximation silently introduces C²–C spurious cross-token outer products per chunk. The authors close this gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across evaluations—including systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics—Tensor Cache consistently improves the memory–quality frontier over bounded-state baselines. This work provides a practical way to extend Transformer context windows without unbounded memory growth, making it valuable for applications like long-document analysis and conversational AI that require retaining information far back in the conversation.

Key Points
  • Two-level cache: L1 sliding-window attention + L2 outer-product fast-weight memory fed by evicted KV pairs, read via a single matrix multiply
  • Learned per-head gating and write-rate parameters enable fine-grained control over memory retention and fusion with local attention
  • Fixes a common training shortcut that introduced spurious cross-token outer products, replacing chunked-mean with per-token weighted sum

Why It Matters

Enables LLMs to retain far-back context without proportional memory growth, critical for long-document and real-time applications.