Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing

  title={Colmena: Scalable Machine-Learning-Based Steering of Ensemble Simulations for High Performance Computing},
  author={Logan T. Ward and Ganesh Sivaraman and J. Gregory Pauloski and Yadu N. Babuji and Ryan Chard and Naveen K. Dandu and Paul C. Redfern and Rajeev S. Assary and Kyle Chard and Larry A. Curtiss and Rajeev Thakur and Ian T. Foster},
  journal={2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)},
Scientific applications that involve simulation ensembles can be accelerated greatly by using experiment design methods to select the best simulations to perform. Methods that use machine learning (ML) to create proxy models of simulations show particular promise for guiding ensembles but are challenging to deploy because of the need to coordinate dynamic mixes of simulation and learning tasks. We present Colmena, an open-source Python framework that allows users to steer campaigns by providing… 

Figures from this paper

Coupling streaming AI and HPC ensembles to achieve 100-1000x faster biomolecular simulations
The results establish DeepDriveMD as a high-performance framework for ML-driven HPC simulation scenarios, that supports diverse MD simulation and ML back-ends, and which enables new scientific insights by improving the length and time scales accessible with current computing capacity.


Proxima: accelerating the integration of machine learning in atomistic simulations
Proxima is proposed, a systematic and automated method for dynamically tuning a surrogate-modeling configuration in response to real-time feedback from the ongoing simulation that respects a wide range of user-defined accuracy goals while achieving speedups of 1.02--5.5X relative to a standard.
Using Machine Learning at Scale in HPC Simulations with SmartSim: An Application to Ocean Climate Modeling
—We demonstrate the first climate-scale, numerical ocean simulations improved through distributed, online inference of Deep Neural Networks (DNN) using SmartSim. SmartSim is a library dedicated to
DeepHyper: Asynchronous Hyperparameter Search for Deep Neural Networks
DeepHyper is presented, a Python package that provides a common interface for the implementation and study of scalable hyperparameter search methods and evaluates the efficacy of these methods relative to approaches such as random search, genetic algorithms, Bayesian optimization, and hyperband on DL benchmarks on CPU-and GPU-based HPC systems.
CANDLE/Supervisor: a workflow framework for machine learning applied to cancer research
This paper presents a workflow system that makes progress on scaling machine learning ensembles, specifically in this first release, ensembled of deep neural networks that address problems in cancer research across the atomistic, molecular and population scales.
Evaluating Optimization Strategies for Engine Simulations Using Machine Learning Emulators
  • D. Probst, M. Raju, Y. Pei
  • Computer Science
    Volume 2: Emissions Control Systems; Instrumentation, Controls, and Hybrids; Numerical Simulation; Engine Design and Mechanical Development
  • 2018
The best performing optimization methods were particle swarm optimization (PSO), differential evolution (DE), GENOUD (an evolutionary algorithm), and Micro-Genetic Algorithm (GA), which found a high median optimum as well as a reasonable minimum optimum of the 100 trials.
Balsam: Near Real-Time Experimental Data Analysis on Supercomputers
This work describes how the Balsam edge service has enabled near-real-time analysis of data collected at the Advanced Photon Source with an X-ray Photon Correlation Spectroscopy application running on Argonne Leadership Computing Facility (ALCF) resources.
Rocketsled: a software library for optimizing high-throughput computational searches
Rocketsled is an open-source Pythonbased software framework to help users optimize arbitrary objective functions that provides a practical framework for establishing complexoptimization schemes with minimal code infrastructure and enables the efficient exploration of otherwise prohibitively large search spaces.
Scheduling many-task workloads on supercomputers: Dealing with trailing tasks
This paper proposes and test two strategies to improve the trade-off between utilization and time to solution under the allocation policies of Blue Gene/P Intrepid at Argonne National Laboratory, scheduling tasks in order of longest to shortest and downsizing allocations when utilization drops below some threshold.
Tune: A Research Platform for Distributed Model Selection and Training
Tune is proposed, a unified framework for model selection and training that provides a narrow-waist interface between training scripts and search algorithms that meets the requirements for a broad range of hyperparameter search algorithms, allows straightforward scaling of search to large clusters, and simplifies algorithm implementation.
Characterizing the Performance of Executing Many-tasks on Summit
The performance of executing many tasks using RP when interfaced with JSM and PRRTE on Summit is characterized and it is found thatPRRTE scales better than JSM for > O(1000) tasks; PRR TE overheads are negligible; and PR RTE supports optimizations that lower the impact of overheads and enable resource utilization of 63% when executing O(16K), 1-core tasks over 404 compute nodes.