Corpus ID: 220962352

PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning

@article{Vogels2020PowerGossipPL,
  title={PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning},
  author={Thijs Vogels and Sai Praneeth Reddy Karimireddy and Martin Jaggi},
  journal={ArXiv},
  year={2020},
  volume={abs/2008.01425}
}
Lossy gradient compression has become a practical tool to overcome the communication bottleneck in centrally coordinated distributed training of machine learning models. However, algorithms for decentralized training with compressed communication over arbitrary connected networks have been more complicated, requiring additional memory and hyperparameters. We introduce a simple algorithm that directly compresses the model differences between neighboring workers using low-rank linear compressors… Expand
1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed
TLDR
A 1-bit Adam is proposed that reduces the communication volume by up to 5×, offers much better scalability, and provides the same sample-wise convergence speed as uncompressed Adam. Expand
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed
TLDR
A new communication-efficient algorithm, 1-bit LAMB, is designed, which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed, and a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance compared to existing MPI-based implementation. Expand
A Flexible Framework for Communication-Efficient Machine Learning
TLDR
A flexible framework which adapts the compression level to the true gradient at each iteration, maximizing the improvement in the objective function that is achieved per communicated bit is proposed. Expand
Consensus Control for Decentralized Deep Learning
TLDR
It is shown in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart, and empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop. Expand
Cross-Gradient Aggregation for Decentralized Learning from Non-IID data
TLDR
This work proposes Cross-Gradient Aggregation (CGA), a novel decentralized learning algorithm where each agent aggregates cross-gradient information and updates its model using a projected gradient based on quadratic programming (QP), and theoretically analyze the convergence characteristics of CGA. Expand
DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation
TLDR
It is theoretically proved that the DATALENS framework guarantees differential privacy for its generated data, and analysis on its convergence is provided, and it is shown that, DATalENS significantly outperforms other baseline DP generative models. Expand
Distributed Online Learning for Joint Regret with Communication Constraints
TLDR
A comparator-adaptive algorithm is provided for this setting, which means that the joint regret scales with the norm of the comparator ‖u‖, which has worstcase optimal regret for the case that all agents communicate in every round. Expand
ErrorCompensatedX: error compensation for variance reduced algorithms
TLDR
ErrorCompensatedX is proposed, which uses the compression error from the previous two steps to achieve the same asymptotic convergence rate with the training without compression, and provides a unified theoretical analysis framework for this class of variance reduced algorithms, with or without error compensation. Expand
Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data
TLDR
This paper investigates and identifies the limitation of several decentralized optimization algorithms for different degrees of data heterogeneity, and proposes a novel momentum-based method to mitigate this decentralized training difficulty. Expand
On Communication Compression for Distributed Optimization on Heterogeneous Data
TLDR
The results indicate that D-EF-SGD is much less affected than D-QSGD by non-iid data, but both methods can suffer a slowdown if data-skewness is high. Expand

References

SHOWING 1-10 OF 38 REFERENCES
DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression
TLDR
This work provides a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server, and admits three very nice properties: it is compatible with an arbitrary compression technique, it admits an improved convergence rate and it admits linear speedup with respect to the number of workers. Expand
Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
TLDR
This paper studies a D-PSGD algorithm and provides the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. Expand
GradZip: Gradient compression using alternating matrix factorization for large-scale deep learning
  • 2019
GradZip: Gradient compression using alternating matrix factorization for large-scale deep learning, 2019
  • 2019
Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation
TLDR
A comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and low-rank), and a unified framework and API that allows for consistent and easy implementation of compressed communication on popular machine learning toolkits are presented. Expand
PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
TLDR
A new low-rank gradient compressor based on power iteration that can compress gradients rapidly, efficiently aggregate the compressed gradients using all-reduce, and achieve test performance on par with SGD is proposed. Expand
Decentralized Deep Learning with Arbitrary Communication Compression
TLDR
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods. Expand
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
TLDR
This work presents a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix. Expand
Advances and Open Problems in Federated Learning
TLDR
Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges. Expand
Distributed stochastic gradient tracking methods
TLDR
It is shown that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost, which is a comparable performance to a centralized stochastic gradient algorithm. Expand
...
1
2
3
4
...