The MIT Supercloud Dataset

@article{Samsi2021TheMS,
  title={The MIT Supercloud Dataset},
  author={Siddharth Samsi and Matthew L Weiss and David Bestor and Baolin Li and Michael Jones and A. Reuther and Daniel Edelman and William Arcand and Chansup Byun and John Holodnack and Matthew Hubbell and Jeremy Kepner and Anna Klein and Joseph McDonald and Adam Michaleas and Peter Michaleas and Lauren Milechin and Julia S. Mullen and Charles Yee and Benjamin Price and Andrew Prout and Antonio Rosa and Allan Vanterpool and Lindsey McEvoy and Anson Cheng and Devesh Tiwari and Vijay N. Gadepally},
  journal={2021 IEEE High Performance Extreme Computing Conference (HPEC)},
  year={2021},
  pages={1-8}
}
Artificial intelligence (AI) and Machine learning (ML) workloads are an increasingly larger share of the compute workloads in traditional High-Performance Computing (HPC) centers and commercial cloud systems. This has led to changes in deployment approaches of HPC clusters and the commercial cloud, as well as a new focus on approaches to optimized resource usage, allocations and deployment of new AI frameworks, and capabilities such as Jupyter notebooks to enable rapid prototyping and… 

Figures and Tables from this paper

The MIT Supercloud Workload Classification Challenge
TLDR
A labelled dataset is introduced that can be used to develop new approaches to workload classification and present initial results based on existing approaches to foster algorithmic innovations in the analysis of compute workloads that can achieve higher accuracy than existing methods.
Using Multi-Instance GPU for Efficient Operation of Multi-Tenant GPU Clusters
TLDR
MISO is proposed, a technique to exploit the Multi-Instance GPU (MIG) capability of NVIDIA A100 GPUs to dynamically partition GPU resources among co-located jobs, without incurring the overhead of implementing them during exploration.
Great Power, Great Responsibility: Recommendations for Reducing Energy for Training Language Models
TLDR
This article investigates techniques that can be used to reduce the energy consumption of common NLP applications and describes the impact of these settings on metrics such as computational performance and energy consumption through experiments conducted on a high performance computing system as well as popular cloud computing platforms.
Developing a Series of AI Challenges for the United States Department of the Air Force
Through a series of federal initiatives and orders, the U.S. Government has been making a concerted effort to ensure American leadership in AI. These broad strategy documents have influenced

References

SHOWING 1-10 OF 34 REFERENCES
Evalix: Classification and Prediction of Job Resource Consumption on HPC Platforms
TLDR
The job prediction component of the Evalix project, which aims at an improved efficiency of the underlying Resource and Job Management System RJMS within heterogeneous HPC facilities by the automatic evaluation and characterization of the submitted workload, is unveiled.
Building and Evaluation of Cloud Storage and Datasets Services on AI and HPC Converged Infrastructure
TLDR
This paper presents the design and integration of the services to conventional HPC architecture, as a case of ABCI, and reports performance evaluation of them.
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
TLDR
A detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise is presented and design guidelines pertaining to next-generation cluster schedulers for DNN training workloads are provided.
Strategies to Deploy and Scale Deep Learning on the Summit Supercomputer
TLDR
It is recommended that users take a step-wise tuning approach beginning with algorithmic kernel choice, node I/O configuration, and communications tuning as best-practice for scaling up DL model training campaigns.
Deploying AI Frameworks on Secure HPC Systems with Containers.
TLDR
The issues associated with the deployment of AI frameworks in a secure HPC environment are discussed and how to successfully deploy AI frameworks on SuperMUC-NG with Charliecloud is discussed.
On the diversity of cluster workloads and its impact on research results
TLDR
An analysis of the private and HPC cluster traces that spans job characteristics, workload heterogeneity, resource utilization, and failure rates shows that the private cluster workloads, consisting of data analytics jobs expected to be more closely related to the Google workload, display more similarity to the HPC Cluster workloads.
Characterization and Comparison of Cloud versus Grid Workloads
TLDR
This paper comprehensively characterize the job/task load and host load in a real-world production data center at Google Inc, using a detailed trace of over 25 million tasks across over 12,500 hosts.
Convergence of artificial intelligence and high performance computing on NSF-supported cyberinfrastructure
TLDR
A summary of recent developments in HPC and AI is presented, and specific advances that authors in this article are spearheading to accelerate and streamline the use of HPC platforms to design and apply accelerated AI algorithms in academia and industry are described.
Multi-Dimensional Regression Host Utilization algorithm (MDRHU) for Host Overload Detection in Cloud Computing
TLDR
This paper provides Multi-Dimensional Regression Host Utilization (MDRHU) algorithms that combine CPU, memory and network BW utilization via Euclidean Distance and absolute summation, respectively, which provide improved results in terms of energy consumption and service level agreement violation.
Integrating Clustering and Learning for Improved Workload Prediction in the Cloud
TLDR
This paper develops a clustering and learning based approach to realize a different solution for task workload prediction that uses the knowledge about the workloads of a pool of tasks to help predict the workloadS of new tasks.
...
...