Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding
New 'cyclic gradient coding' technique protects model training from malicious devices and heterogeneous data.
A team of researchers from institutions including EPFL and KTH Royal Institute of Technology has published a new paper, 'Byzantine-Robust and Communication-Efficient Distributed Training: Compressive and Cyclic Gradient Coding,' introducing a novel method to solve a critical flaw in distributed AI training. Current methods for training models across many devices (like smartphones or servers) struggle when some devices are malicious ("Byzantine" attackers sending incorrect data) or when the data on each device is highly varied ("data heterogeneity"). The existing robust aggregation rules fail, causing solution errors that don't improve with more training.
To overcome this, the researchers developed LAD (cyclic gradient coding-based Distributed Training). Before training begins, the entire dataset is distributed to all worker devices. In each training iteration, the central server uses a 'cyclic gradient coding' scheme to assign redundant computational tasks to each device. Honest devices compute gradients on their assigned data subsets, encode them, and send the coded vectors to the server. The server then aggregates these, filtering out potentially incorrect messages from malicious devices. This redundancy allows the system to mathematically guarantee convergence and significantly lower the final error, even under attack. The team also created Com-LAD, a compressed version that reduces the communication bandwidth needed, making the method practical for real-world constrained settings like federated learning on mobile networks.
- Proposes LAD, a method using 'cyclic gradient coding' to redundantly assign tasks, ensuring training convergence even with malicious devices and non-identical data.
- Introduces a compressed variant, Com-LAD, that significantly reduces communication overhead, crucial for bandwidth-limited environments like federated learning.
- Analytically proves improved robustness and lower solution error, validated by numerical results, addressing a key limitation in prior Byzantine-robust aggregation techniques.
Why It Matters
Enables more secure and reliable large-scale AI training across decentralized, untrusted devices, advancing practical federated learning and collaborative AI.