• Corpus ID: 231933673

MARINA: Faster Non-Convex Distributed Learning with Compression

@article{Gorbunov2021MARINAFN,
  title={MARINA: Faster Non-Convex Distributed Learning with Compression},
  author={Eduard A. Gorbunov and Konstantin Burlachenko and Zhize Li and Peter Richt{\'a}rik},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.07845}
}
We develop and analyze MARINA: a new communication efficient method for non-convex distributed learning over heterogeneous datasets. MARINA employs a novel communication compression strategy based on the compression of gradient differences that is reminiscent of but different from the strategy employed in the DIANA method of Mishchenko et al. (2019). Unlike virtually all competing distributed first-order methods, including DIANA, ours is based on a carefully designed biased gradient estimator… 

Figures and Tables from this paper

DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization
TLDR
The theory of DASHA improves the theoretical oracle and communication complexity of the previous state-of-the-art method MARINA and is corroborated in practice: a significant improvement in experiments with nonconvex classification and training of deep learning models.
Faster Rates for Compressed Federated Learning with Client-Variance Reduction
TLDR
Both COFIG and FRECON do not need to communicate with all the clients and provide first/faster convergence results for convex and nonconvex federated learning, while previous works either require full clients communication or obtain worse convergence results.
EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback
TLDR
It is proved that EF21 enjoys a fast O(1/T ) convergence rate for smooth nonconvex problems, beating the previous bound of O( 1/T ), which was shown under a strong bounded gradients assumption.
BEER: Fast O(1/T) Rate for Decentralized Nonconvex Optimization with Communication Compression
TLDR
This paper proposes BEER, which adopts communication compression with gradient tracking, and shows it converges at a faster rate of O (1 /T ) than the state-of-the-art rate, by matching the rate without compression even under arbitrary data heterogeneity.
FL_PyTorch: optimization research simulator for federated learning
TLDR
FL_PyTorch is a suite of open-source software written in python that builds on top of one the most popular research Deep Learning (DL) framework PyTorch to enable fast development, prototyping and experimenting with new and existing FL optimization algorithms.
3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy Aggregation
We propose and study a new class of gradient communication mechanisms for communicationefficient training—three point compressors (3PC)—as well as efficient distributed nonconvex optimization
Decentralized Optimization Over Noisy, Rate-Constrained Networks: Achieving Consensus by Communicating Differences
TLDR
A novel algorithm, Decentralized Lazy Mirror Descent with Differential Exchanges (DLMD-DiffEx), which guarantees convergence of the local estimates to the optimal solution under the given communication constraints, and investigates the performance of DLMD-diffEx both from a theoretical perspective as well as through numerical evaluations on synthetic data and MNIST.
FedShuffle: Recipes for Better Use of Local Work in Federated Learning
TLDR
This work presents a comprehensive theoretical analysis of FedShuffle and shows that both theoretically and empirically, the approach does not suffer from the objective function mismatch that is present in FL methods which assume homogeneous updates in heterogeneous FL setups.
Privacy-Aware Compression for Federated Data Analysis
TLDR
This work proposes a mechanism for transmitting a single real number that has optimal variance under certain conditions and shows how to extend it to metric differential privacy for location privacy use-cases, as well as vectors, for application to federated learning.
Server-Side Stepsizes and Sampling Without Replacement Provably Help in Federated Optimization
TLDR
The results are the first to show that the widely popular heuristic of scaling the client updates with an extra parameter is very useful in the context of Federated Averaging with local passes over the client data and the first time that local steps provably help to overcome the communication bottleneck.
...
1
2
3
...

References

SHOWING 1-10 OF 49 REFERENCES
Federated Learning with Compression: Unified Analysis and Sharp Guarantees
TLDR
This work proposes a set of algorithms with periodical compressed (quantized or sparsified) communication and analyzes their convergence properties in both homogeneous and heterogeneous local data distributions settings and introduces a scheme to mitigate data heterogeneity.
Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations
TLDR
This paper proposes Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients, and demonstrates that it converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers.
LIBSVM: A library for support vector machines
TLDR
Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
A Unified Theory of Decentralized SGD with Changing Topology and Local Updates
TLDR
This paper introduces a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.
Decentralized Deep Learning with Arbitrary Communication Compression
TLDR
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.
PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization
TLDR
The results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the theoretical results and confirming the practical superiority of PAGE.
A Unified Analysis of Stochastic Gradient Methods for Nonconvex Federated Optimization
TLDR
This paper provides a single convergence analysis for all methods that satisfy the proposed unified assumption of the second moment of the stochastic gradient, thereby offering a unified understanding of SGD variants in the nonconvex regime instead of relying on dedicated analyses of each variant.
On Biased Compression for Distributed Learning
TLDR
It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques.
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.
Federated Learning: Strategies for Improving Communication Efficiency
TLDR
Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
...
1
2
3
4
5
...