Research & Papers

Why Attend to Everything? Focus is the Key

New 'Focus' technique retrofits existing LLMs to run 8.6x faster while improving performance, requiring no model weight changes.

Deep Dive

A research team including Hengshuai Yao, Xing Chen, and eight others has introduced Focus, a breakthrough method for making large language models dramatically more efficient without modifying their core weights. Unlike traditional attention mechanisms that approximate all token interactions, Focus learns which token pairs actually matter through learnable centroids that assign tokens to groups. This approach restricts distant attention to same-group pairs while maintaining full-resolution local attention, requiring training of only 148K additional parameters regardless of model size. The method works as a pure retrofit—all existing model weights stay frozen—yet improves domain perplexity with zero degradation on downstream benchmarks across five different attention architectures and model scales from 124M to 70B parameters.

At inference time, Focus delivers substantial speedups by restricting each token to its top-k highest-scoring groups, creating a hard sparsity pattern that yields 2x speedup while actually beating pretrained baselines on perplexity (41.3 vs 42.8 PPL). Remarkably, this pattern can be decomposed into two standard FlashAttention calls, achieving 8.6x wall-clock speedup at 1M tokens without requiring custom kernels. Unlike parameter-efficient fine-tuning methods like LoRA, Focus preserves model alignment—instruction-tuned models retain their TruthfulQA scores after adaptation, while LoRA degrades performance at every learning rate and rank tested. The method's Sinkhorn normalization enforces balanced groups as a hard constraint, and the resulting groups discover interpretable linguistic categories without any supervision.

Key Points
  • Focus retrofits existing LLMs with just 148K trainable parameters, improving perplexity without weight changes across 124M to 70B parameter models
  • Achieves 8.6x wall-clock speedup at 1M tokens using standard FlashAttention calls, beating full attention on perplexity (13.82 vs 13.89 PPL at 7B scale)
  • Preserves model alignment where LoRA fails—instruction-tuned models retain TruthfulQA scores while LoRA degrades at every learning rate and rank

Why It Matters

Enables dramatic speed improvements for existing LLMs without costly retraining or performance trade-offs, making large-scale deployment more practical.