Anomaly detection in large-scale coalition clusters for dependability assurance

Abstract

In large-scale high-performance computing systems, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators. When a compute node fails to function properly, health-related data are valuable… (More)
DOI: 10.1109/HIPC.2010.5713169

6 Figures and Tables

Topics

  • Presentations referencing similar topics