Research & Papers

Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning

New technique shards gradients to fit any model within 10GB memory limit.

Deep Dive

Federated learning aggregation on serverless platforms has hit a hard scalability ceiling: existing architectures like lambda-FL and LIFL partition clients across aggregators, but each aggregator must hold the complete model gradient in memory. When gradients exceed the per-function memory limit—10 GB on AWS Lambda—aggregation becomes infeasible regardless of tree depth or branching factor.

Amine Barrak's GradsSharding instead partitions the gradient tensor into M shards, each averaged independently by a serverless function that receives contributions from all clients. Because FedAvg averaging is element-wise, this produces bit-identical results to tree-based approaches, so model accuracy is invariant by construction. Per-function memory is bounded at O(|θ|/M), independent of client count, enabling aggregation of arbitrarily large models. Evaluations against lambda-FL and LIFL across model sizes from 43 MB to 5 GB show a cost crossover at approximately 500 MB gradient size, 2.7x cost reduction at VGG-16 scale, and that GradsSharding is the only architecture that remains deployable beyond the serverless memory ceiling.

Key Points
  • GradsSharding partitions gradients into M shards, each averaged independently by a serverless function
  • Per-function memory is O(|θ|/M), independent of client count, enabling arbitrarily large models
  • Achieves 2.7x cost reduction at VGG-16 scale and is the only architecture deployable beyond 10GB memory limit

Why It Matters

Enables federated learning on massive models using cost-effective serverless infrastructure, removing a key scalability bottleneck.