Corpus ID: 220962352

PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning

  title={PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning},
  author={Thijs Vogels and Sai Praneeth Reddy Karimireddy and Martin Jaggi},
Lossy gradient compression has become a practical tool to overcome the communication bottleneck in centrally coordinated distributed training of machine learning models. However, algorithms for decentralized training with compressed communication over arbitrary connected networks have been more complicated, requiring additional memory and hyperparameters. We introduce a simple algorithm that directly compresses the model differences between neighboring workers using low-rank linear compressors… Expand
1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed
A 1-bit Adam is proposed that reduces the communication volume by up to 5×, offers much better scalability, and provides the same sample-wise convergence speed as uncompressed Adam. Expand
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed
A new communication-efficient algorithm, 1-bit LAMB, is designed, which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed, and a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance compared to existing MPI-based implementation. Expand
A Flexible Framework for Communication-Efficient Machine Learning
A flexible framework which adapts the compression level to the true gradient at each iteration, maximizing the improvement in the objective function that is achieved per communicated bit is proposed. Expand
Consensus Control for Decentralized Deep Learning
It is shown in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart, and empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop. Expand
Cross-Gradient Aggregation for Decentralized Learning from Non-IID data
This work proposes Cross-Gradient Aggregation (CGA), a novel decentralized learning algorithm where each agent aggregates cross-gradient information and updates its model using a projected gradient based on quadratic programming (QP), and theoretically analyze the convergence characteristics of CGA. Expand
DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation
It is theoretically proved that the DATALENS framework guarantees differential privacy for its generated data, and analysis on its convergence is provided, and it is shown that, DATalENS significantly outperforms other baseline DP generative models. Expand
Distributed Online Learning for Joint Regret with Communication Constraints
A comparator-adaptive algorithm is provided for this setting, which means that the joint regret scales with the norm of the comparator ‖u‖, which has worstcase optimal regret for the case that all agents communicate in every round. Expand
ErrorCompensatedX: error compensation for variance reduced algorithms
ErrorCompensatedX is proposed, which uses the compression error from the previous two steps to achieve the same asymptotic convergence rate with the training without compression, and provides a unified theoretical analysis framework for this class of variance reduced algorithms, with or without error compensation. Expand
Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data
This paper investigates and identifies the limitation of several decentralized optimization algorithms for different degrees of data heterogeneity, and proposes a novel momentum-based method to mitigate this decentralized training difficulty. Expand
On Communication Compression for Distributed Optimization on Heterogeneous Data
The results indicate that D-EF-SGD is much less affected than D-QSGD by non-iid data, but both methods can suffer a slowdown if data-skewness is high. Expand


DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression
This work provides a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server, and admits three very nice properties: it is compatible with an arbitrary compression technique, it admits an improved convergence rate and it admits linear speedup with respect to the number of workers. Expand
Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent
This paper studies a D-PSGD algorithm and provides the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. Expand
GradZip: Gradient compression using alternating matrix factorization for large-scale deep learning
  • 2019
GradZip: Gradient compression using alternating matrix factorization for large-scale deep learning, 2019
  • 2019
Compressed Communication for Distributed Deep Learning: Survey and Quantitative Evaluation
A comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and low-rank), and a unified framework and API that allows for consistent and easy implementation of compressed communication on popular machine learning toolkits are presented. Expand
PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization
A new low-rank gradient compressor based on power iteration that can compress gradients rapidly, efficiently aggregate the compressed gradients using all-reduce, and achieve test performance on par with SGD is proposed. Expand
Decentralized Deep Learning with Arbitrary Communication Compression
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods. Expand
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
This work presents a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix. Expand
Advances and Open Problems in Federated Learning
Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges. Expand
Distributed stochastic gradient tracking methods
It is shown that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost, which is a comparable performance to a centralized stochastic gradient algorithm. Expand