Research & Papers

Flash-MaxSim GPU kernel cuts memory 16x, speeds late-interaction retrieval 3.9x

New fused kernel avoids materializing 21GB tensor, handles 10K documents on one GPU.

Deep Dive

Late-interaction retrieval models like ColBERT and ColPali depend on the MaxSim operator: for each query token, the maximum similarity over all document tokens is summed. The standard implementation materializes a full query-token by document-token similarity tensor in GPU memory. For visual ColPali with 10,000 documents, that tensor alone consumes 21 GB in FP16 — generated only to be reduced to a single score per document and then discarded. This exhausts a 40 GB GPU and severely limits batch sizes during inference and training.

Flash-MaxSim solves this with an IO-aware fused kernel that tiles query and document embeddings and streams them through on-chip SRAM, folding the row-max reduction into the same pass so the full tensor is never materialized. The design extends to training via an inverse-grid CSR construction that reuses the forward argmax for atomic-free gradient reduction, and supports INT8INT8 quantization with variable-length (padding-free) scoring. In benchmarks, Flash-MaxSim achieves up to 3.9x speedup on an A100 (4.7x on H100) over naive PyTorch at matched precision, while reducing inference memory by up to 16x and training memory by up to 28x. Critically, it preserves exact ranking — 100% top-20 agreement with an FP32 reference — meaning no retrieval quality is sacrificed for the gains.

Key Points
  • Up to 3.9x faster on A100 and 4.7x faster on H100 than naive PyTorch at matched precision.
  • Reduces memory usage by 16x during inference and ~28x during training, enabling larger batch sizes.
  • 100% top-20 ranking agreement with FP32 reference — exact scores preserved without approximation.

Why It Matters

Enables massive-scale late-interaction retrieval on a single GPU, cutting cost and unlocking larger corpus sizes.