MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

  title={MLPerf{\texttrademark} HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems},
  author={Steven Andrew Farrell and Murali Krishna Emani and Jacob Balma and Lukas Drescher and Aleksandr Drozd and Andreas Fink and Geoffrey Fox and David Kanter and Thorsten Kurth and Peter Mattson and Dawei Mu and Amit Ruhela and Kento Sato and Koichi Shirahata and Tsuguchika Tabaru and Aristeidis Tsaris and Jan Balewski and Benjamin Cumming and Takumi Danjo and Jens Domke and Takaaki Fukai and Naoto Fukumoto and Tatsuya Fukushi and Balazs Gerofi and Takumi Honda and Toshiyuki Imamura and Akihiko Kasagi and Kentaro Kawakami and Shuhei Kudo and Akiyoshi Kuroda and Maxime Martinasso and Satoshi Matsuoka and Henrique Mendonc and Kazuki Minami and Prabhat Ram and Takashi Sawada and Mallikarjun (Arjun) Shankar and Tom St. John and Akihiro Tabuchi and Venkatram Vishwanath and Mohamed Wahib and Masafumi Yamazaki and Junqi Yin},
  journal={2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)},
  • S. Farrell, M. Emani, +40 authors Junqi Yin
  • Published 21 October 2021
  • Computer Science
  • 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)
Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf™ is a community-driven… 


HPC AI500: The Methodology, Tools, Roofline Performance Models, and Metrics for Benchmarking HPC AI Systems
Evaluations show the methodology, benchmarks, performance models, and metrics can measure, optimize, and rank the HPC AI systems in a scalable, simple, and affordable way.
XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs
XSP is proposed — an across-stack profiling design that gives a holistic and hierarchical view of ML model execution that accurately captures the latencies at all levels of the HW/SW stack in spite of the profiling overhead.
A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning
Deep500 is the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques and provides software infrastructure to utilize the most powerful supercomputers for extreme-scale workloads.
Benchmarking TPU, GPU, and CPU Platforms for Deep Learning
ParaDnn is introduced, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected, convolutional (CNN), and recurrent (RNN) neural networks, and the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms are quantified.
Fathom: reference workloads for modern deep learning methods
This paper assembles Fathom: a collection of eight archetypal deep learning workloads, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group, and focuses on understanding the fundamental performance characteristics of each model.
Exploiting Parallelism Opportunities with Deep Learning Frameworks
Across a diverse set of real-world deep learning models, the evaluation results show that the proposed performance tuning guidelines outperform the Intel and TensorFlow recommended settings by 1.30× and 1.38×, respectively.
24/7 Characterization of petascale I/O workloads
Darshan is demonstrated to have the ability to characterize the I/O behavior of four scientific applications and it is demonstrated that it induces negligible overhead for I-O intensive jobs with as many as 65,536 processes.
DAWNBench : An End-to-End Deep Learning Benchmark and Competition
DAWNBench is introduced, a benchmark and competition focused on end-to-end training time to achieve a state-of-the-art accuracy level, as well as inference with that accuracy, and will provide a useful, reproducible means of evaluating the many tradeoffs in deep learning systems.
An Extended Roofline Model with Communication-Awareness for Distributed-Memory HPC Systems
A simple and intuitive graphical model, which extends the widely used Roofline performance model to include the communication cost in addition to the memory access time and the peak CPU performance, and enables performance evaluation on a third dimension of communication performance.
Sarus: Highly Scalable Docker Containers for HPC Systems
Docker containers of HPC applications deployed with Sarus show two significant results: OCI hooks allow users and system administrators to transparently benefit from plugins that enable system-specific hardware; and the same level of performance and scalability than native execution is achieved.