Research & Papers

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

New storage API replaces thousands of GET requests with a single, fault-tolerant streaming operation.

Deep Dive

A team of researchers including Alex Aizman, Abhishek Gaikwad, and Piotr Żelasko has published a preprint detailing GetBatch, a novel API designed to solve a critical bottleneck in large-scale machine learning. Modern training pipelines consume data in batches, often requiring thousands of samples drawn from shards distributed across a storage cluster. Issuing a corresponding flood of individual GET requests creates massive per-request overhead that can dominate total data transfer time, slowing down model iteration. GetBatch addresses this by elevating batch retrieval to a first-class storage operation, fundamentally changing how training jobs interact with data stores.

The technical innovation replaces independent GET operations with a single, deterministic, and fault-tolerant streaming execution. In benchmark tests, GetBatch demonstrated up to a 15x throughput improvement for small objects. More importantly, in a simulated production training workload, it reduced the P95 batch retrieval latency by 2x and slashed the P99 per-object tail latency by 3.7x compared to the traditional approach. This performance leap means ML engineers and infrastructure teams can significantly accelerate training cycles for models like GPT-4, Llama 3, or Claude 3.5, where data loading is often the hidden limiter. The work suggests a necessary evolution in storage system design to keep pace with the demands of next-generation AI.

Key Points
  • Replaces thousands of individual GET requests with one streaming operation for batch data retrieval.
  • Achieves up to 15x throughput gain for small objects and cuts P99 tail latency by 3.7x.
  • Directly targets and alleviates a major I/O bottleneck in production-scale ML training pipelines.

Why It Matters

This directly accelerates AI model training cycles, reducing costs and iteration time for companies running large-scale ML workloads.