Corpus ID: 219956317

A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning

@article{Horvath2021ABA,
  title={A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning},
  author={Samuel Horv'ath and Peter Richt{\'a}rik},
  journal={ArXiv},
  year={2021},
  volume={abs/2006.11077}
}
Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the… Expand
Preserved central model for faster bidirectional compression in distributed settings
TLDR
A new approach to tackle communication constraints in a distributed learning problem with a central server is developed and a new algorithm that performs bidirectional compression and achieves the same convergence rate as algorithms using only uplink compression is proposed. Expand
FedSKETCH: Communication-Efficient and Private Federated Learning via Sketching
TLDR
This work introduces FedSKetCH and FedSKETCHGATE algorithms to address both challenges in Federated learning jointly, where these algorithms are intended to be used for homogeneous and heterogeneous data distribution settings respectively. Expand
Optimal Client Sampling for Federated Learning
TLDR
A novel client subsampling scheme where the number of clients allowed to communicate their updates back to the master node is restricted and a simple algorithm is provided that approximates the optimal formula for client participation which only requires secure aggregation and thus does not compromise client privacy. Expand
FjORD: Fair and Accurate Federated Learning under heterogeneous targets with Ordered Dropout
TLDR
This work introduces Ordered Dropout, a mechanism that achieves an ordered, nested representation of knowledge in Neural Networks and enables the extraction of lower footprint submodels without the need of retraining, in the realm of FL in a framework called FjORD. Expand
Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation
We study the problem of estimating at a central server the mean of a set of vectors distributed across several nodes (one vector per node). When the vectors are highdimensional, the communicationExpand
Rethinking gradient sparsification as total error minimization
TLDR
This work identifies that the total error — the sum of the compression errors for all iterations — encapsulates sparsification throughout training and proposes a communication complexity model that minimizes the totalerror under a communication budget for the entire training. Expand
A Field Guide to Federated Optimization
TLDR
This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. Expand
Compressed Communication for Distributed Training: Adaptive Methods and System
TLDR
This paper introduces a novel adaptive gradient method with gradient compression that has a convergence rate of O(1/ √ T ) for non-convex problems and develops a scalable system called BytePS-Compress for two-way compression, where the gradients are compressed in both directions between workers and parameter servers. Expand
Compressed Gradient Tracking Methods for Decentralized Optimization with Linear Convergence
TLDR
This paper proposes a novel compressed gradient tracking algorithm (C-GT) that combines gradient tracking technique with communication compression and shows that C-GT is compatible with a general class of compression operators that unifies both unbiased and biased compressors. Expand
EF21 with Bells&Whistles: Practical Algorithmic Extensions of Modern Error Feedback
TLDR
Six practical extensions of EF21 are proposed, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression. Expand
...
1
2
...

References

SHOWING 1-10 OF 60 REFERENCES
Natural Compression for Distributed Deep Learning
TLDR
This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. Expand
On Biased Compression for Distributed Learning
TLDR
It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques. Expand
QSGD: Randomized Quantization for Communication-Optimal Stochastic Gradient Descent
TLDR
Quantized SGD is proposed, a family of compression schemes which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions, and allows the user to trade off compression and convergence time. Expand
3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning
TLDR
3LC is presented, a lossy compression scheme for state change traffic that strikes balance between multiple goals: traffic reduction, accuracy, computation overhead, and generality. Expand
Sparsified SGD with Memory
TLDR
This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation. Expand
99% of Distributed Optimization is a Waste of Time: The Issue and How to Fix it
TLDR
A new variant of parallel block coordinate descent based on independent sparsification of the local gradient estimates before communication is developed, which is supported through extensive numerical experiments which demonstrate an almost perfect match with the theory on a number of synthetic and real datasets. Expand
On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning
TLDR
It is proved that layer-wise compression is, in theory, better, because the convergence rate is upper bounded by that of entire-model compression for a wide range of biased and unbiased compression methods. Expand
Convex Optimization using Sparsified Stochastic Gradient Descent with Memory
TLDR
A sparsification scheme for SGD where only a small constant number of coordinates are applied at each iteration, which outperforms QSGD in progress per number of bits sent and opens the path to using lock-free asynchronous parallelization on dense problems. Expand
Error Feedback Fixes SignSGD and other Gradient Compression Schemes
TLDR
It is proved that the algorithm EF-SGD with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions, and thus EF- SGD achieves gradient compression for free. Expand
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates. Expand
...
1
2
3
4
5
...