FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery

  title={FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery},
  author={Kento Sato and Adam Moody and Kathryn Mohror and Todd Gamblin and Bronis R. de Supinski and Naoya Maruyama and Satoshi Matsuoka},
  journal={2014 IEEE 28th International Parallel and Distributed Processing Symposium},
Future supercomputers built with more components will enable larger, higher-fidelity simulations, but at the cost of higher failure rates. Traditional approaches to mitigating failures, such as checkpoint/restart (C/R) to a parallel file system incur large overheads. On future, extreme-scale systems, it is unlikely that traditional C/R will recover a failed application before the next failure occurs. To address this problem, we present the Fault Tolerant Messaging Interface (FMI), which enables… CONTINUE READING
Highly Cited
This paper has 22 citations. REVIEW CITATIONS
12 Citations
24 References
Similar Papers


Publications citing this paper.
Showing 1-10 of 12 extracted citations


Publications referenced by this paper.
Showing 1-10 of 24 references

Similar Papers

Loading similar papers…