# Fault Tolerance in Iterative-Convergent Machine Learning

@article{Qiao2019FaultTI, title={Fault Tolerance in Iterative-Convergent Machine Learning}, author={A. Qiao and Bryon Aragam and Bingjing Zhang and E. Xing}, journal={ArXiv}, year={2019}, volume={abs/1810.07354} }

Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative-convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness… CONTINUE READING

8 Citations

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

- Computer Science, Mathematics
- 2020

2

#### References

##### Publications referenced by this paper.

SHOWING 1-10 OF 66 REFERENCES

High-Performance Distributed ML at Scale through Parameter Server Consistency Models

- Computer Science, Mathematics
- 2015

82

Managed communication and consistency for fast data-parallel iterative analytics

- Computer Science
- 2015

95

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

- Mathematics, Computer Science
- 2011

1658

Straggler Mitigation in Distributed Optimization Through Data Encoding

- Computer Science, Mathematics
- 2017

88