Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery

@inproceedings{Bouteiller2015PlanBI,
  title={Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery},
  author={Aurelien Bouteiller and George Bosilca and Jack J. Dongarra},
  booktitle={EuroMPI},
  year={2015}
}
Advanced failure recovery strategies in HPC system benefit tremendously from in-place failure recovery, in which the MPI infrastructure can survive process crashes and resume communication services. In this paper we present the rationale behind the specification, and an effective implementation of the Revoke MPI operation. The purpose of the Revoke operation is the propagation of failure knowledge, and the interruption of ongoing, pending communication, under the control of the user. We explain… CONTINUE READING

From This Paper

Figures, tables, and topics from this paper.

Citations

Publications citing this paper.

References

Publications referenced by this paper.

Similar Papers

Loading similar papers…