rTop-k: A Statistical Estimation Approach to Distributed SGD

@article{Barnes2020rTopkAS,
  title={rTop-k: A Statistical Estimation Approach to Distributed SGD},
  author={Leighton Pate Barnes and Huseyin A. Inan and Berivan Isik and Ayfer {\"O}zg{\"u}r},
  journal={IEEE Journal on Selected Areas in Information Theory},
  year={2020},
  volume={1},
  pages={897-907}
}
The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-<inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> and random-<inline… 

Figures and Tables from this paper

Distributed Sparse SGD with Majority Voting

A novel majority voting based sparse communication strategy is introduced, in which the workers first seek a consensus on the structure of the sparse representation, which provides a significant reduction in the communication load and allows using the same sparsity level in both communication directions.

Breaking the Communication-Privacy-Accuracy Trilemma

Novel encoding and decoding mechanisms that simultaneously achieve optimal privacy and communication efficiency in various canonical settings are developed and it is demonstrated that intelligent encoding under jointPrivacy and communication constraints can yield a performance that matches the optimal accuracy achievable under either constraint alone.

Successive Pruning for Model Compression via Rate Distortion Theory

This work studies NN compression from an information-theoretic approach and shows that rate distortion theory suggests pruning to achieve the theoretical limits of Nn compression, and provides an end-to-end compression pipeline involving a novel pruning strategy.

Sparse Random Networks for Communication-Efficient Federated Learning

This work proposes a radically different approach to federated learning that does not update the weights at all, and freezes the weight updates at their initial random values and learns how to sparsify the random network for the best performance.

DRAGONN: Distributed Randomized Approximate Gradients of Neural Networks

DRAGN is proposed, a randomized hashing algorithm for GS in DDT that can significantly reduce the compression time by up to 70% compared to state-of-the-art GS approaches, and achieve up to 3 .

Over-the-Air Statistical Estimation of Sparse Models

This work shows that analog schemes that design estimation and communication jointly can efficiently exploit the inherent sparsity in high-dimensional models and observations, and provide drastic improvements over digital schemes that separate source and channel coding in this context.

Model Segmentation for Storage Efficient Private Federated Learning with Top r Sparsification

Two schemes with different properties that use MDS coded storage along with a model segmentation mechanism to reduce the storage cost at the expense of a controllable amount of information leakage, to perform private FL with top r sparsification are presented.

Rate-Privacy-Storage Tradeoff in Federated Learning with Top r Sparsification

The general trade-off between the communication cost, storage cost, and information leakage in private FL with top r sparsification is provided, along the lines of two proposed schemes.

ResFed: Communication Efficient Federated Learning by Transmitting Deep Compressed Residuals

A residual-based federated learning framework (ResFed) is introduced to address the bottleneck in federation deployment in wireless networks with increasing model size.

FedLTN: Federated Learning for Sparse and Personalized Lottery Ticket Networks

FedLTN is proposed, a novel approach motivated by the well-known Lottery Ticket Hypothesis to learn sparse and personalized lottery ticket networks (LTNs) for communication-efficient and personalized FL under non-identically and independently distributed (non-IID) data settings.

References

SHOWING 1-10 OF 51 REFERENCES

Sparsified SGD with Memory

This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation.

Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations

This paper proposes Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients, and demonstrates that it converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers.

On Communication Cost of Distributed Statistical Estimation and Dimensionality

It is conjecture that the tradeoff between communication and squared loss demonstrated by this protocol is essentially optimal up to logarithmic factor, and the strong lower bounds in the general setting are initiated.

Communication lower bounds for statistical estimation problems via a distributed data processing inequality

A distributed data processing inequality is proved, as a generalization of usual data processing inequalities, which might be of independent interest and useful for other problems.

Communication-Efficient Distributed Learning of Discrete Probability Distributions

This work designs distributed learning algorithms that achieve significantly better communication guarantees than the naive ones, and obtain tight upper and lower bounds in several regimes of this basic estimation task.

Geometric Lower Bounds for Distributed Parameter Estimation Under Communication Constraints

This work circumvents the need for strong data processing inequalities used in prior work and develops a geometric approach which builds on a new representation of the communication constraint which allows it to strengthen and generalize existing results with simpler and more transparent proofs.

Gradient Sparsification for Communication-Efficient Distributed Optimization

This paper proposes a convex optimization formulation to minimize the coding length of stochastic gradients and experiments on regularized logistic regression, support vector machines, and convolutional neural networks validate the proposed approaches.

Federated Learning: Strategies for Improving Communication Efficiency

Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.

Fisher Information for Distributed Estimation under a Blackboard Communication Protocol

We consider the problem of learning high-dimensional discrete distributions and structured (e.g. Gaussian) distributions in distributed networks, where each node in the network observes an

Lower Bounds for Learning Distributions under Communication Constraints via Fisher Information

We consider the problem of learning high-dimensional, nonparametric and structured (e.g. Gaussian) distributions in distributed networks, where each node in the network observes an independent sample
...