A System Software Approach to Proactive Memory-Error Avoidance


Today's HPC systems use two mechanisms to address main-memory errors. Error-correcting codes make correctable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that… (More)
DOI: 10.1109/SC.2014.63

19 Figures and Tables


