Corpus ID: 222177196

Optimal Gradient Compression for Distributed and Federated Learning

  title={Optimal Gradient Compression for Distributed and Federated Learning},
  author={Alyazeed Albasyoni and Mher Safaryan and Laurent Condat and Peter Richt{\'a}rik},
Communicating information, like gradient vectors, between computing nodes in distributed and federated learning is typically an unavoidable burden, resulting in scalability issues. Indeed, communication might be slow and costly. Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques, in the form of sparsification, quantization, or low-rank approximation. Since compression is a lossy, or inexact, process, the iteration… Expand

Figures and Tables from this paper

Distributed Learning and Democratic Embeddings: Polynomial-Time Source Coding Schemes Can Achieve Minimax Lower Bounds for Distributed Gradient Descent under Communication Constraints
The utility of the proposed source coding scheme can remarkably improve the performance when used in conjunction with other compression schemes, and achieves the minimax optimal convergence rate to within (near) constant factors for a class of communication-constrained distributed optimization algorithms. Expand
Smoothness-Aware Quantization Techniques
This work extends Safaryan et al.'s smoothness-aware compression strategy to arbitrary unbiased compression operators, which also includes sparsification, and provides extensive numerical evidence that their smooths-aware quantization strategies outperform existing quantization schemes as well the aforementioned smoothity-aware sparsifying strategies with respect to all relevant success measures. Expand
Rethinking gradient sparsification as total error minimization
This work identifies that the total error — the sum of the compression errors for all iterations — encapsulates sparsification throughout training and proposes a communication complexity model that minimizes the totalerror under a communication budget for the entire training. Expand
Wyner-Ziv Estimators: Efficient Distributed Mean Estimation with Side Information
Without any probabilistic assumptions on the underlying data, the problem of distributed mean estimation where the server has access to side information is studied and Wyner-Ziv estimators are proposed, which are communication and computationally efficient and near-optimal when an upper bound for the distance between the side information and the data is known. Expand
MURANA: A Generic Framework for Stochastic Variance-Reduced Optimization
A generic variance-reduced algorithm for minimizing a sum of several smooth functions plus a regularizer, in a sequential or distributed manner, formulated with general stochastic operators, which allow it to cover many existing randomization mechanisms within a unified framework. Expand
Improved Communication Efficiency for Distributed Mean Estimation with Side Information
  • Kai Liang, Youlong Wu
  • Computer Science, Mathematics
  • 2021 IEEE International Symposium on Information Theory (ISIT)
  • 2021
This paper proposes a practical and efficient estimator based on an r-bit Wynzer-Ziv estimator proposed by Mayekar et al., which requires no probabilistic assumption on the data. Expand
Federated Learning with Spiking Neural Networks
This paper proposes a federated learning framework for decentralized and privacy preserving training of Spiking Neural Networks and observes that SNNs outperform ANNs in terms of overall accuracy by over 15% when the data is distributed across a large number of clients in the federation while providing up to 5.3× energy efficiency. Expand
Layer-wise Adaptive Model Aggregation for Scalable Federated Learning
  • Sunwoo Lee, Tuo Zhang, Chaoyang He, S. Avestimehr
  • Computer Science
  • 2021
This work proposes FedLAMA, a layer-wise model aggregation scheme for scalable Federated Learning that reduces the communication cost by up to 60% for IID data and 70% for non-IID data while achieving a comparable accuracy to FedAvg. Expand
Parallel and Distributed algorithms for ML problems
A survey of modern parallel and distributed approaches to solve sum-type convex minimization problems come from ML applications is made. Expand


Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback
A general distributed compressed SGD with Nesterov's momentum is proposed, which achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\% less wall clock time. Expand
Natural Compression for Distributed Deep Learning
This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. Expand
Federated Learning: Strategies for Improving Communication Efficiency
Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling. Expand
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
This paper finds 99.9% of the gradient exchange in distributed SGD is redundant, and proposes Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth, which enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributedTraining on mobile. Expand
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques. Expand
DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression
This work provides a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server, and admits three very nice properties: it is compatible with an arbitrary compression technique, it admits an improved convergence rate and it admits linear speedup with respect to the number of workers. Expand
PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
A new low-rank gradient compressor based on power iteration that can compress gradients rapidly, efficiently aggregate the compressed gradients using all-reduce, and achieve test performance on par with SGD is proposed. Expand
Distributed learning with compressed gradients
Asynchronous computation and gradient compression have emerged as two key techniques for achieving scalability in distributed optimization for large-scale machine learning. This paper presents aExpand
Distributed Learning with Compressed Gradient Differences
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates. Expand
On Biased Compression for Distributed Learning
It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques. Expand