Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation

@article{Tang2006ProposalOM,
  title={Proposal of MPI Operation Level Checkpoint/Rollback and One Implementation},
  author={Yuan Tang and Graham E. Fagg and Jack J. Dongarra},
  journal={Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06)},
  year={2006},
  volume={1},
  pages={27-34}
}
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we extended the MPI specification on handling fault tolerance by specifying a systematic framework for the recovery methods, communicator, message modes etc. that define the behavior of MPI in case an error occurs. These extensions not only specify how the implementation of the MPI library… CONTINUE READING

References

Publications referenced by this paper.
SHOWING 1-10 OF 18 REFERENCES

An overview of high performance computers, clusters, and grid computing

  • J. Dongarra
  • 2nd Teraflop Workbench Workshop,
  • 2005
1 Excerpt

System processor counts/systems in top500 list june

  • T. Organization
  • 2005
1 Excerpt

System processor counts/systems in top500 list nov

  • T. Organization
  • http://www.top500.org/lists//11/charts.php?c=12,
  • 2004
1 Excerpt

Similar Papers

Loading similar papers…