HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
A pluggable module that boosts training throughput up to 90x on long sequences
Abhinaba Basu's new paper introduces HubRouter, a pluggable module that replaces traditional O(n²) attention layers with O(nM) hub-mediated routing, where M is a small number of learned hub tokens (M << n). The module implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. This design dramatically reduces computational complexity for long sequences.
HubRouter was validated in three settings. Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines (an optimized baseline would narrow this to ~10-15x). Graduated replacement of 25% of Transformer attention layers gives the best perplexity in matched-budget sweeps (268.0 vs 282.4 pure Transformer). Hub-GPT provides strictly causal routing, achieving PPL 211.5 ± 0.4 over 3 seeds, approximately 3 PPL worse than Jamba's 208.5 ± 0.7. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds).
- Replaces O(n²) attention with O(nM) routing using M learned hub tokens (M=8-14 recommended)
- Achieves up to ~90x training throughput at sequence length 1024 in PyTorch-native baselines
- Hub-Jamba shows 4.2% PPL improvement; graduated 25% replacement beats pure Transformer perplexity
Why It Matters
Makes long-context Transformers practical for production by slashing compute costs while maintaining quality.