Diagnosing Performance Variations in HPC Applications Using Machine Learning

@inproceedings{Tuncer2017DiagnosingPV,
  title={Diagnosing Performance Variations in HPC Applications Using Machine Learning},
  author={Ozan Tuncer and Emre Ates and Yijia Zhang and Ata Turk and Jim M. Brandt and Vitus J. Leung and Manuel Egele and Ayse Kivilcim Coskun},
  booktitle={ISC},
  year={2017}
}
With the growing complexity and scale of high performance computing (HPC) systems, application performance variation has become a significant challenge in efficient and resilient system management. Application performance variation can be caused by resource contention as well as softwareand firmware-related problems, and can lead to premature job termination, reduced performance, and wasted compute platform resources. To effectively alleviate this problem, system administrators must detect and… CONTINUE READING
Highly Cited
This paper has 20 citations. REVIEW CITATIONS

Citations

Publications citing this paper.
Showing 1-10 of 14 extracted citations

A Comprehensive Informative Metric for Analyzing HPC System Status Using the LogSCAN Platform

2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) • 2018
View 3 Excerpts

Reviewing Cloud Monitoring: Towards Cloud Resource Profiling

2018 IEEE 11th International Conference on Cloud Computing (CLOUD) • 2018
View 1 Excerpt

Tangram: Colocating HPC Applications with Oversubscription

2018 IEEE High Performance extreme Computing Conference (HPEC) • 2018
View 1 Excerpt

References

Publications referenced by this paper.
Showing 1-10 of 31 references

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis • 2014
View 6 Excerpts
Highly Influenced

Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example

2010 International Conference on Dependable Systems and Networks Workshops (DSN-W) • 2010
View 13 Excerpts
Highly Influenced

The NAS parallel benchmarks summary and preliminary results

Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91) • 1991
View 4 Excerpts
Highly Influenced

Identifying the Culprits Behind Network Congestion

2015 IEEE International Parallel and Distributed Processing Symposium • 2015
View 1 Excerpt

Toward Rapid Understanding of Production HPC Applications and Systems

2015 IEEE International Conference on Cluster Computing • 2015
View 3 Excerpts