CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

@article{Gupta2009CIFTSAC,
  title={CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems},
  author={Rinku Gupta and Peter H. Beckman and Byung-Hoon Park and Ewing L. Lusk and Paul Hargrove and Al Geist and Dhabaleswar K. Panda and Andrew Lumsdaine and Jack J. Dongarra},
  journal={2009 International Conference on Parallel Processing},
  year={2009},
  pages={237-245}
}
Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software… CONTINUE READING
Highly Cited
This paper has 73 citations. REVIEW CITATIONS

7 Figures & Tables

Topics

Statistics

01020302009201020112012201320142015201620172018
Citations per Year

73 Citations

Semantic Scholar estimates that this publication has 73 citations based on the available data.

See our FAQ for additional information.