Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies

Abstract

As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious design concerns. Efficiently running systems at such large scales critically relies on deploying effective, practical methods for fault tolerance while having a good understanding of their… (More)
DOI: 10.1109/TDSC.2016.2548463

15 Figures and Tables

Topics

  • Presentations referencing similar topics