Research & Papers

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

arXiv cs.DC February 27, 2026

⚡New storage API replaces thousands of GET requests with a single, fault-tolerant streaming operation.

Deep Dive

A team of researchers including Alex Aizman, Abhishek Gaikwad, and Piotr Żelasko has published a preprint detailing GetBatch, a novel API designed to solve a critical bottleneck in large-scale machine learning. Modern training pipelines consume data in batches, often requiring thousands of samples drawn from shards distributed across a storage cluster. Issuing a corresponding flood of individual GET requests creates massive per-request overhead that can dominate total data transfer time, slowing down model iteration. GetBatch addresses this by elevating batch retrieval to a first-class storage operation, fundamentally changing how training jobs interact with data stores.

The technical innovation replaces independent GET operations with a single, deterministic, and fault-tolerant streaming execution. In benchmark tests, GetBatch demonstrated up to a 15x throughput improvement for small objects. More importantly, in a simulated production training workload, it reduced the P95 batch retrieval latency by 2x and slashed the P99 per-object tail latency by 3.7x compared to the traditional approach. This performance leap means ML engineers and infrastructure teams can significantly accelerate training cycles for models like GPT-4, Llama 3, or Claude 3.5, where data loading is often the hidden limiter. The work suggests a necessary evolution in storage system design to keep pace with the demands of next-generation AI.

Key Points

Replaces thousands of individual GET requests with one streaming operation for batch data retrieval.
Achieves up to 15x throughput gain for small objects and cuts P99 tail latency by 3.7x.
Directly targets and alleviates a major I/O bottleneck in production-scale ML training pipelines.

Why It Matters

This directly accelerates AI model training cycles, reducing costs and iteration time for companies running large-scale ML workloads.

Read Original Article

GetBatch: Distributed Multi-Object Retrieval for ML Data Loading

Why It Matters

Stay Ahead in AI