Distributed Fault Tolerant Linear System Solvers Based on Erasure Coding

Abstract

We present efficient coding schemes and distributed implementations of erasure coded linear system solvers. Erasure coded computations belong to the class of algorithmic fault tolerance schemes. They are based on augmenting an input dataset, executing the algorithm on the augmented dataset, and in the event of a fault, recovering the solution from the corresponding augmented solution. This process can be viewed as the computational analog of erasure coded storage schemes. The proposed technique has a number of important benefits: (i) as the hardware platform scales in size and number of faults, our scheme yields increasing improvement in resource utilization, compared to traditional schemes; (ii) the proposed scheme is easy to code – the core algorithms remain the same; and (iii) the general scheme is flexible – accommodating a range of computation and communication tradeoffs. We present new coding schemes for augmenting the input matrix that satisfy the recovery equations of erasure coding with high probability in the event of random failures. These coding schemes also minimize fill (non-zero elements introduced by the coding block), while being amenable to efficient partitioning across processing nodes. We demonstrate experimentally that our scheme adds minimal overhead for fault tolerance, yields excellent parallel efficiency and scalability, and is robust to different fault arrival models.

DOI: 10.1109/ICDCS.2017.261

9 Figures and Tables

Cite this paper

@article{Kang2017DistributedFT, title={Distributed Fault Tolerant Linear System Solvers Based on Erasure Coding}, author={Xuejiao Kang and David F. Gleich and Ahmed H. Sameh and Ananth Grama}, journal={2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)}, year={2017}, pages={2478-2485} }