Research & Papers

SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data

Handles 800M texts across 40K partitions with 2.6 GB peak memory – a production breakthrough.

Deep Dive

A team of researchers (Kapadia, Mishra, Alugubelli, Kumar, Yadav, Bhatia) have introduced SURGE (SuperBatch Unified Resource-efficient GPU Encoding), a streaming system designed to solve a critical bottleneck in production embedding pipelines: the tension between logical data partitioning and efficient GPU utilization. When processing each partition independently, pipelines incur P inter-process communication calls, limiting throughput for compute-light models like bge-base (109M parameters, d=768). The naive fix—batching at fixed size—requires O(N) peak memory (32.7 GB for 10M texts, infeasible beyond ~60M on 192 GB nodes), produces no output until complete, and offers no fault tolerance. SURGE replaces this with a streaming two-threshold policy that achieves O(B_min + n_max) bounded memory (just 2.6 GB for 10M texts).

SURGE's key theoretical contributions include a cost model (Theorem 1) that predicts throughput within 2% across three encoders spanning a 15× parameter range, a memory-safety bound (Lemma 3) enabling the streaming policy, and a φ/CV decision framework characterizing when the pattern applies beyond their workload. In practice, on 10M texts with 4 NVIDIA L4 GPUs, SURGE delivers 26,413 texts/s—matching fixed-batch throughput—while using 12.6× less memory. It also achieves 68× faster time-to-first-output and crash recovery at SuperBatch granularity. Against a partition-batched baseline (PB-PBP-LB), SURGE retains a 7% throughput edge and 2.5× faster TTFO. Complementary engineering—zero-copy Arrow serialization (22–25× speedup) and async I/O pipelining (up to 93% benefit)—realizes the design. Validated on bge-base and across log-normal distributions, SURGE shows speedup invariance within ±3%.

Key Points
  • SURGE processes 800M texts across 40,000 partitions with peak memory of only O(B_min + n_max) = 2.6 GB instead of O(N) = 32.7 GB for 10M texts.
  • Matches fixed-batch throughput at 26,413 texts/s on 4 L4 GPUs while using 12.6× less memory, and offers 68× faster time-to-first-output.
  • Includes a cost model predicting throughput within 2% across a 15× encoder parameter range, and a φ/CV decision framework for generalization.

Why It Matters

Enables large-scale embedding generation on limited GPU memory, drastically reducing infrastructure cost and latency for production RAG pipelines.