On the Survivability of Standard MPI Applications

  title={On the Survivability of Standard MPI Applications},
  author={Anand Tikotekar and Chokchai Leangsuksun and Stephen L. Scott},
Job loss due to failure represents a common vulnerability in High Performance Computing (HPC), especially in the Message Passing Interface (MPI) environment. Rollback-recovery has been used to mitigate faulty issues for long running applications. However, to date, the rollback-recovery such as checkpoint mechanism alone may not be sufficient to ensure fault… CONTINUE READING