Sparton: Fast and Memory-Efficient Triton Kernel for Learned Sparse Retrieval
New Triton kernel eliminates memory bottleneck in learned sparse retrieval, enabling massive model scaling.
A team of researchers including Thong Nguyen, Cosimo Rulli, and Andrew Yates has introduced Sparton, a novel GPU kernel built with the Triton language to solve a critical bottleneck in Learned Sparse Retrieval (LSR). Models like Splade use a language model head to create a logit matrix the size of the vocabulary (30k to 250k tokens), which must be fully stored in memory before reduction. This materialization creates a massive I/O overhead, throttling training speed and limiting batch sizes. Sparton's innovation is a fused kernel that performs tiled matrix multiplication, applies activation functions (ReLU, Log1P), and executes max-pooling reduction in one pass, directly on raw logit tiles.
By performing this 'early online reduction,' Sparton never materializes the full V x sequence-length matrix. Benchmarks show the isolated kernel achieves a 4.8x speedup and an order-of-magnitude (10x) reduction in peak memory usage compared to standard PyTorch implementations. When integrated into real models, the gains are transformative: for a standard Splade model (~30k vocab), it enables a 33% larger batch size and 14% faster training. For a large multilingual model with a 250k vocabulary, the improvements jump to a 26x larger batch size and 2.5x faster training throughput, all with no loss in retrieval effectiveness.
This technical breakthrough directly addresses the scaling limitations of state-of-the-art retrieval systems. By drastically reducing memory pressure, Sparton allows researchers and engineers to train much larger and more effective sparse retrieval models on existing hardware. It paves the way for more complex, vocabulary-rich models that can understand nuanced queries across multiple languages, making powerful semantic search more efficient and accessible.
- Fuses four operations (matmul, ReLU, Log1P, max-pool) into one Triton GPU kernel, avoiding costly I/O.
- Achieves 4.8x speedup and 10x memory reduction in isolation versus PyTorch baselines.
- Enables 26x larger batch sizes and 2.5x faster training for large models (250k vocabulary).
Why It Matters
Enables training of larger, more accurate AI search models faster and on existing hardware, advancing semantic retrieval.