Stochastic Gradient Coding for Straggler Mitigation in Distributed Learning

@article{Bitar2020StochasticGC,
  title={Stochastic Gradient Coding for Straggler Mitigation in Distributed Learning},
  author={Rawad Bitar and Mary Wootters and Salim el Rouayheb},
  journal={IEEE Journal on Selected Areas in Information Theory},
  year={2020},
  volume={1},
  pages={277-291}
}
We consider distributed gradient descent in the presence of stragglers. Recent work on <italic>gradient coding</italic> and <italic>approximate gradient coding</italic> have shown how to add redundancy in distributed gradient descent to guarantee convergence even if some workers are <italic>stragglers</italic>—that is, slow or non-responsive. In this work we propose an approximate gradient coding scheme called <italic>Stochastic Gradient Coding</italic> (SGC), which works when the stragglers… 

Figures and Tables from this paper

Stochastic Gradient Coding for Flexible Straggler Mitigation in Distributed Learning

TLDR
It is proved that the convergence rate of SGC mirrors that of batched Stochastic Gradient Descent for the $l_{2}$ loss function, and it is shown how the converge rate can improve with the redundancy.

Live Gradient Compensation for Evading Stragglers in Distributed Learning

TLDR
A Live Gradient Compensation (LGC) strategy to incorporate the one-step delayed gradients from stragglers, aiming to accelerate learning process and utilize the straggler nodes simultaneously is developed, and the numerical results demonstrate the effectiveness of the proposed strategy.

Optimal Communication-Computation Trade-Off in Heterogeneous Gradient Coding

TLDR
This paper characterize the optimum communication cost for heterogeneous distributed systems with \emph{arbitrary} data placement, and proposes an approximate gradient coding scheme for the cases when the repetition in data placement is smaller than what is needed to meet the restriction imposed on communication cost.

Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning

TLDR
This paper designs a new gradient coding scheme for mitigating partial stragglers in distributed learning and considers a distributed system consisting of one master and N workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with L model parameters using gradient coding.

Lightweight Projective Derivative Codes for Compressed Asynchronous Gradient Descent

TLDR
A novel algorithm is proposed that encodes the partial derivatives themselves and furthermore optimizes the codes by performing lossy compression on the derivative codewords by maximizing the information contained in the codeword while minimizing the information between thecodewords.

Optimization-based Block Coordinate Gradient Coding

TLDR
This paper considers a distributed computation system consisting of one master and N workers characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with $L$ model parameters, obtaining an optimal solution using a stochastic projected subgradient method and proposing two low-complexity approximate solutions with closed-from expressions for the stochastically optimization problem.

Approximate Gradient Coding With Optimal Decoding

TLDR
This work introduces novel approximate gradient codes based on expander graphs, in which each machine receives exactly two blocks of data points, and demonstrates empirically that these schemes achieve near-optimal error in the random setting and converge faster than algorithms which do not use the optimal decoding coefficients.

Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning

TLDR
A novel paradigm of dynamic coded computation is introduced, which assigns redundant data to workers to acquire the ability to dynamically choose from among a set of possible codes depending on the past straggling behavior, called GC-DC, and regulates the number of stragglers in each cluster by dynamically forming the clusters at each iteration.

Approximate Gradient Coding for Heterogeneous Nodes

TLDR
This work introduces a heterogeneous straggler model where nodes are categorized into two classes, slow and active, and modify the existing gradient coding schemes with shuffling of the training data among workers to better utilize training data stored with slow nodes.

LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning

TLDR
A unified analysis of gradient coding, worker grouping, and adaptive worker selection techniques in terms of wall-clock time, communication, and computation complexity measures shows that G-LAG provides the best wall- clock time and communication performance while maintaining a low computational cost.

References

SHOWING 1-10 OF 66 REFERENCES

Distributed Stochastic Gradient Descent Using LDGM Codes

TLDR
In the proposed system, it may take longer time than existing GC methods to recover the gradient information completely, however, it enables the master node to obtain a high-quality unbiased estimator of the gradient at low computational cost and it leads to overall performance improvement.

Speeding Up Distributed Machine Learning Using Codes

TLDR
This paper focuses on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling, and uses codes to reduce communication bottlenecks, exploiting the excess in storage.

Gradient Coding Based on Block Designs for Mitigating Adversarial Stragglers

TLDR
This work proposes a class of approximate gradient codes based on balanced incomplete block designs (BIBDs), and shows that the approximation error for these codes depends only on the number of stragglers, and thus, adversarial straggler selection has no advantage over random selection.

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

TLDR
This work proves that the proposed Batched Coupon's Collector (BCC) scheme is robust to a near optimal number of random stragglers, and reduces the run-time by up to 85.4% over Amazon EC2 clusters when compared with other straggler mitigation strategies.

Improving Distributed Gradient Descent Using Reed-Solomon Codes

TLDR
This work adopts the framework of Tandon et al. and presents a deterministic scheme that, for a prescribed per-machine computational effort, recovers the gradient from the least number of machines theoretically permissible, via an O(f^{2})$ decoding algorithm.

Gradient Coding From Cyclic MDS Codes and Expander Graphs

TLDR
This paper designs novel gradient codes using tools from classical coding theory, namely, cyclic MDS codes, which compare favorably with existing solutions, both in the applicable range of parameters and in the complexity of the involved algorithms.

Straggler Mitigation in Distributed Optimization Through Data Encoding

TLDR
This paper proposes several encoding schemes, and demonstrates that popular batch algorithms, such as gradient descent and L-BFGS, applied in a coding-oblivious manner, deterministically achieve sample path linear convergence to an approximate solution of the original problem, using an arbitrarily varying subset of the nodes at each iteration.

Fundamental Limits of Approximate Gradient Coding

TLDR
Two approximate gradient coding schemes that exactly match such lower bounds based on random edge removal process are proposed, which provide order-wise improvement over the state of the art in terms of computation load, and are also optimal in both computation load and latency.

Gradient Coding: Avoiding Stragglers in Distributed Learning

We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to

ErasureHead: Distributed Gradient Descent without Delays Using Approximate Gradient Coding

TLDR
ErasureHead is presented, a new approach for distributed gradient descent that mitigates system delays by employing approximate gradient coding and can lead to significant speedups over both standard and gradient coded GD.
...