Research & Papers

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

A pluggable module that boosts training throughput up to 90x on long sequences

Deep Dive

Abhinaba Basu's new paper introduces HubRouter, a pluggable module that replaces traditional O(n²) attention layers with O(nM) hub-mediated routing, where M is a small number of learned hub tokens (M << n). The module implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. This design dramatically reduces computational complexity for long sequences.

HubRouter was validated in three settings. Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines (an optimized baseline would narrow this to ~10-15x). Graduated replacement of 25% of Transformer attention layers gives the best perplexity in matched-budget sweeps (268.0 vs 282.4 pure Transformer). Hub-GPT provides strictly causal routing, achieving PPL 211.5 ± 0.4 over 3 seeds, approximately 3 PPL worse than Jamba's 208.5 ± 0.7. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds).

Key Points
  • Replaces O(n²) attention with O(nM) routing using M learned hub tokens (M=8-14 recommended)
  • Achieves up to ~90x training throughput at sequence length 1024 in PyTorch-native baselines
  • Hub-Jamba shows 4.2% PPL improvement; graduated 25% replacement beats pure Transformer perplexity

Why It Matters

Makes long-context Transformers practical for production by slashing compute costs while maintaining quality.