Research & Papers

GradsSharding breaks serverless memory ceiling for large model FL

New technique shards gradients to fit any model within 10GB memory limit.

Deep Dive

Federated learning aggregation on serverless platforms has hit a hard scalability ceiling: existing architectures like lambda-FL and LIFL partition clients across aggregators, but each aggregator must hold the complete model gradient in memory. When gradients exceed the per-function memory limit—10 GB on AWS Lambda—aggregation becomes infeasible regardless of tree depth or branching factor.

Amine Barrak's GradsSharding instead partitions the gradient tensor into M shards, each averaged independently by a serverless function that receives contributions from all clients. Because FedAvg averaging is element-wise, this produces bit-identical results to tree-based approaches, so model accuracy is invariant by construction. Per-function memory is bounded at O(|θ|/M), independent of client count, enabling aggregation of arbitrarily large models. Evaluations against lambda-FL and LIFL across model sizes from 43 MB to 5 GB show a cost crossover at approximately 500 MB gradient size, 2.7x cost reduction at VGG-16 scale, and that GradsSharding is the only architecture that remains deployable beyond the serverless memory ceiling.

Key Points
  • GradsSharding partitions gradients into M shards, each averaged independently by a serverless function
  • Per-function memory is O(|θ|/M), independent of client count, enabling arbitrarily large models
  • Achieves 2.7x cost reduction at VGG-16 scale and is the only architecture deployable beyond 10GB memory limit

Why It Matters

Enables federated learning on massive models using cost-effective serverless infrastructure, removing a key scalability bottleneck.

📬 Get the top 10 AI stories daily