FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery

  title={FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery},
  author={K. Sato and A. Moody and K. Mohror and T. Gamblin and B. Supinski and N. Maruyama and S. Matsuoka},
  journal={2014 IEEE 28th International Parallel and Distributed Processing Symposium},
  • K. Sato, A. Moody, +4 authors S. Matsuoka
  • Published 2014
  • Computer Science
  • 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • Future supercomputers built with more components will enable larger, higher-fidelity simulations, but at the cost of higher failure rates. Traditional approaches to mitigating failures, such as checkpoint/restart (C/R) to a parallel file system incur large overheads. On future, extreme-scale systems, it is unlikely that traditional C/R will recover a failed application before the next failure occurs. To address this problem, we present the Fault Tolerant Messaging Interface (FMI), which enables… CONTINUE READING
    Local recovery and failure masking for stencil-based applications at extreme scales
    • 22
    • Open Access
    Evaluating and extending user-level fault tolerance in MPI applications
    • 19
    • Open Access
    EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications
    • 10
    • Open Access
    Big Data Analytics in the Cloud: Spark on Hadoop vs MPI/OpenMP on Beowulf
    • 135
    • Open Access
    A Resilient Framework for Iterative Linear Algebra Applications in X10
    • 7
    • Open Access
    MAMS: A Highly Reliable Policy for Metadata Service
    • 3
    • Open Access


    Publications referenced by this paper.
    Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
    • 459
    • Open Access
    Understanding failures in petascale computers
    • 370
    • Open Access
    The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
    • 298
    • Open Access
    Design and modeling of a non-blocking checkpointing system
    • 80
    • Open Access