See applications run and throughput jump: The case for redundant computing in HPC

@article{Riesen2010SeeAR,
  title={See applications run and throughput jump: The case for redundant computing in HPC},
  author={Rolf Riesen and Kurt B. Ferreira and Jon Stearley},
  journal={2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)},
  year={2010},
  pages={29-34}
}
For future parallel-computing systems with as few as twenty-thousand nodes we propose redundant computing to reduce the number of application interrupts. The frequency of faults in exascale systems will be so high that traditional checkpoint/restart methods will break down. Applications will experience interruptions so often that they will spend more time restarting and recovering lost work, than computing the solution. We show that redundant computation at large scale can be cost effective and… CONTINUE READING
Highly Cited
This paper has 18 citations. REVIEW CITATIONS

From This Paper

Figures, tables, and topics from this paper.

Citations

Publications citing this paper.
Showing 1-10 of 11 extracted citations

Energy Efficient Fault Tolerance for High Performance Computing (HPC) in the Cloud

2013 IEEE Sixth International Conference on Cloud Computing • 2013
View 5 Excerpts
Highly Influenced

Design and implementation of a hardware checkpoint/restart core

IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) • 2012
View 2 Excerpts

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

2012 IEEE 26th International Parallel and Distributed Processing Symposium • 2012
View 1 Excerpt

1 Reliability Analysis of Resilient Applications

Kathleen McGill, Stephen Taylor, Hanover Cummings Hall, Kathleen. N. McGill, S. Taylor
2011
View 2 Excerpts

Evaluation of process level redundant checkpointing/restart for HPC systems

30th IEEE International Performance Computing and Communications Conference • 2011
View 2 Excerpts

References

Publications referenced by this paper.
Showing 1-10 of 11 references

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

2009 International Conference on Parallel Processing • 2009
View 1 Excerpt

Top 500 supercomputer site

H. Meuer, E. Strohmaier, H. Simon, J. Dongarra
http://www.top500.org/, Nov. • 2009
View 1 Excerpt

Modeling the Impact of Checkpoints on Next-Generation Systems

24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007) • 2007
View 2 Excerpts

A Large-Scale Study of Failures in High-Performance Computing Systems

IEEE Transactions on Dependable and Secure Computing • 2006
View 1 Excerpt

Performance implications of periodic checkpointing on large-scale cluster systems

19th IEEE International Parallel and Distributed Processing Symposium • 2005
View 1 Excerpt

Reliability Engineering Handbook, volume 2

D. B. Kececioglu
DEStech Publications, Inc, May • 2002
View 1 Excerpt

On Ramanujan's Q{function

Peter J. Grabner, Peter Kirschenhofer, Helmut ProdingerDedicated
1992
View 1 Excerpt

Similar Papers

Loading similar papers…