Corpus ID: 52985688

Fault Tolerance in Iterative-Convergent Machine Learning

@article{Qiao2019FaultTI,
  title={Fault Tolerance in Iterative-Convergent Machine Learning},
  author={A. Qiao and Bryon Aragam and Bingjing Zhang and E. Xing},
  journal={ArXiv},
  year={2019},
  volume={abs/1810.07354}
}
  • A. Qiao, Bryon Aragam, +1 author E. Xing
  • Published 2019
  • Computer Science, Mathematics
  • ArXiv
  • Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative-convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness… CONTINUE READING
    MixML: A Unified Analysis of Weakly Consistent Parallel Learning
    1
    Algorithm-Based Fault Tolerance for Convolutional Neural Networks
    2
    On Efficient Constructions of Checkpoints
    Robust Distributed Learning
    1

    References

    Publications referenced by this paper.
    SHOWING 1-10 OF 66 REFERENCES
    Speeding up distributed machine learning using codes
    144
    High-Performance Distributed ML at Scale through Parameter Server Consistency Models
    82
    Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent
    172
    Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
    1658
    Addressing the straggler problem for iterative convergent parallel ML
    61
    Straggler Mitigation in Distributed Optimization Through Data Encoding
    88
    Exploiting Bounded Staleness to Speed Up Big Data Analytics
    111