Research & Papers

ZorBA: Zeroth-order Federated Fine-tuning of LLMs with Heterogeneous Block Activation

New federated learning method reduces VRAM usage by 62% while maintaining model performance across distributed clients.

Deep Dive

A research team led by Chuiyang Meng has introduced ZorBA, a novel federated learning framework designed to overcome the significant memory and communication bottlenecks in fine-tuning large language models (LLMs) across distributed clients. Traditional federated fine-tuning requires each client to store the entire model and its gradients, leading to prohibitive video random-access memory (VRAM) usage and heavy communication overhead from frequent model exchanges. ZorBA tackles this by implementing a zeroth-order optimization approach that eliminates the need to store gradients at client devices, instead using forward passes to estimate updates. This core innovation is paired with a heterogeneous block activation mechanism, where the central server strategically allocates different subsets of the model's transformer blocks to different clients for updating.

The technical architecture of ZorBA is built around optimizing which blocks are activated on which clients to jointly maximize convergence speed and minimize VRAM consumption. The framework uses shared random seeds and finite differences of gradients to drastically reduce the communication payload between clients and the central server. The researchers formulated this as an optimization problem and developed an ε-constraint lexicographic algorithm to solve it. Experimental results demonstrate that ZorBA outperforms three existing federated fine-tuning baselines, achieving up to a 62.41% reduction in VRAM usage while incurring low communication overhead. This breakthrough paves the way for more efficient collaborative AI, enabling institutions with limited hardware—such as hospitals or research labs—to jointly refine powerful LLMs on sensitive, decentralized data without the traditional memory constraints.

Key Points
  • Uses zeroth-order optimization to eliminate gradient storage, reducing client-side memory footprint
  • Heterogeneous block activation assigns different transformer blocks to clients, cutting VRAM usage by 62.41%
  • Employs shared random seeds and gradient finite differences to minimize communication overhead

Why It Matters

Enables organizations with limited GPU memory to collaboratively train state-of-the-art AI models on private, distributed datasets.