• Corpus ID: 246485766

Peering Beyond the Gradient Veil with Distributed Auto Differentiation

  title={Peering Beyond the Gradient Veil with Distributed Auto Differentiation},
  author={Bradley T. Baker and Aashis Khanal and Vince D. Calhoun and Barak A. Pearlmutter and S. Plis},
Although distributed machine learning has opened up many new and exciting research frontiers, fragmentation of models and data across different machines, nodes, and sites still results in considerable communication overhead, impeding reliable training in real-world contexts. The focus on gradients as the primary shared statistic during training has spawned a number of intuitive algorithms for distributed deep learning; however, gradient-centric training of large deep neural networks (DNNs… 



Distributed Learning with Compressed Gradient Differences

This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.

Stochastic Gradient Push for Distributed Deep Learning

Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.

Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication

SBC combines existing techniques of communication delay and gradient sparsification with a novel binarization method and optimal weight update encoding to push compression gains to new limits to mitigate the limited communication bandwidth between contributing nodes or prohibitive communication cost for distributed training.

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

A new class of gradient methods for distributed machine learning that adaptively skip the gradient calculations to learn with reduced communication and computation is presented, justifying the acronym LAG used henceforth.

signSGD: compressed optimisation for non-convex problems

SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

This paper finds 99.9% of the gradient exchange in distributed SGD is redundant, and proposes Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth, which enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributedTraining on mobile.

GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training

This paper empirically demonstrate the strong linear correlations between CNN gradients, and proposes a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction.

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.

Decentralized Deep Learning with Arbitrary Communication Compression

The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.