Research & Papers

NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches

New architecture adds permanent low-rank branches, achieving 1.47x faster convergence with minimal overhead.

Deep Dive

Researchers at Canva Research have introduced NOBLE (Nonlinear lOw-rank Branch for Linear Enhancement), a novel architectural modification for transformer models that promises to significantly accelerate training without substantial cost increases. Unlike existing methods like LoRA (Low-Rank Adaptation) that are designed for parameter-efficient fine-tuning, NOBLE is built directly into the model architecture during initial pretraining. The approach adds small, permanent nonlinear branches to standard transformer linear layers, creating a formula of σ(xW_down)W_up where σ is a learnable nonlinearity called CosNet—a two-layer cosine function with adjustable frequency and phase.

In practical testing across multiple model types including 250M and 1.5B parameter LLMs, BERT, VQGAN, and Vision Transformers (ViT), NOBLE demonstrated remarkable efficiency gains. The system achieved up to 1.47x faster convergence to baseline evaluation loss, meaning models reached the same performance level in 32% fewer training steps. This speedup came with minimal overhead: just 4% additional parameters and 7% increased step time, resulting in a net wallclock speedup of 1.22x. The research did identify one limitation—NOBLE's benefits diminished when used with certain stochastic data augmentation techniques like Mixup and CutMix in ImageNet classification, suggesting the method may specialize in sharper aspects of target functions rather than smoother fits.

The implications are substantial for organizations training large transformer models from scratch, as NOBLE offers a straightforward architectural change that could reduce training costs and time by over 20% with negligible parameter overhead. While the method shows some incompatibility with specific regularization techniques, its consistent performance improvements across diverse transformer architectures suggest it could become a standard component in future model designs, potentially changing how researchers approach transformer architecture optimization during the pretraining phase.

Key Points
  • Achieves 1.47x faster convergence (32% fewer steps) with only 4% parameter overhead
  • Uses permanent CosNet nonlinear branches instead of temporary adapters like LoRA
  • Demonstrated across LLMs (up to 1.5B params), BERT, VQGAN, and ViT with 1.22x net speedup

Why It Matters

Could reduce transformer training costs by over 20% for companies building models from scratch, accelerating AI development timelines.