Deep Reinforcement Agent for Scheduling in HPC
@article{Fan2021DeepRA, title={Deep Reinforcement Agent for Scheduling in HPC}, author={Yuping Fan and Zhiling Lan and J. Taylor Childers and Paul M. Rich and William E. Allcock and Michael E. Papka}, journal={2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)}, year={2021}, pages={807-816} }
Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More…
Figures and Tables from this paper
15 Citations
MRSch: Multi-Resource Scheduling for HPC
- Computer Science, Business2022 IEEE International Conference on Cluster Computing (CLUSTER)
- 2022
An intelligent scheduling agent named MRSch is presented for multi-resource scheduling in HPC that leverages direct future prediction (DFP), an advanced multi-objective reinforcement learning algorithm to tackle the challenges involved in multi- resource scheduling.
Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing
- Computer Science2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)
- 2021
This paper proposes a novel HPC scheduling agent named FARS (Failure-aware RL-based scheduler) by considering the effects of job failures, which shows that, compared with the best baseline model, FARS obtains 5.69% improvement of average make-span under different device error rates.
DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2022
This work presents an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning and demonstrates that DRAS outperforms the existing heuristic and optimization approaches by up to 50%.
SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning
- Computer Science, BusinessHPDC
- 2022
One key advantage of SchedInspector is it automatically learns to work with and improve existing job scheduling policies without changing them, which makes it promising to serve as a generic enhancer for various batchJob scheduling policies.
Hybrid Workload Scheduling on HPC Systems
- Computer Science2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2022
This study presents several scheduling mechanisms to address the issues involved in co-scheduling on-demand, rigid, and malleable jobs on a single HPC system, and extensively evaluate and compare their performance under various configurations and workloads.
Not All Tasks Are Created Equal: Adaptive Resource Allocation for Heterogeneous Tasks in Dynamic Workflows
- Computer Science2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS)
- 2021
It is shown that tasks performing different work may have significantly different resource consumption and that exploiting the heterogeneity of tasks is a desirable way to reveal and predict the relationship between tasks and their resource consumption, reduce waste from resource misallocation, increase tasks' consumption efficiency, and incentivize users' cooperation.
RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction
- Computer ScienceApplied Sciences
- 2021
RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM, and the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning- based schedulers.
Job Scheduling in High Performance Computing
- Computer Science, BusinessArXiv
- 2021
This research study investigated challenges faced by HPC scheduling and state-of-art scheduling methods to overcome these challenges, and proposed an intelligent scheduling framework to alleviate the problems encountered in modern job scheduling.
On the impact of MDP design for Reinforcement Learning agents in Resource Management
- Computer ScienceBRACIS
- 2021
It is shown that, in the authors' experiments, when using Multi-Layer Perceptrons as approximation function, a compact state representation allows transfer of agents between environments, and that transferred agents have good performance and outperform specialized agents in 80% of the tested scenarios, even without retraining.
Application Checkpoint and Power Study on Large Scale Systems
- Computer Science, EngineeringArXiv
- 2021
This study analyzes the relation of application checkpoints and their power consumption and the observations could guide the design of power management.
23 References
Learning scheduling algorithms for data processing clusters
- Computer Science, BusinessSIGCOMM
- 2019
It is shown that modern machine learning techniques can generate highly-efficient policies automatically and improve average job completion time by at least 21% over hand-tuned scheduling heuristics, achieving up to 2x improvement during periods of high cluster load.
Multi-Resource Packing for Cluster Schedulers
- Computer Science
- 2014
This work presents Tetris, a cluster scheduler that packs, i.e., matches multi-resource task requirements with resource availabilities of machines so as to increase cluster efficiency (makespan).
Resource Management with Deep Reinforcement Learning
- Computer Science, BusinessHotNets
- 2016
This work presents DeepRM, an example solution that translates the problem of packing tasks with multiple resource demands into a learning problem, and shows that it performs comparably to state-of-the-art heuristics, adapts to different conditions, converges quickly, and learns strategies that are sensible in hindsight.
Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling
- Computer ScienceIEEE Trans. Parallel Distributed Syst.
- 2001
The study of backfilling to the accuracy of the runtime estimates provided by the users and a very surprising result is found: Backfilling actually works better when users overestimate the runtime by a substantial factor.
Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne
- Computer ScienceJSSPP
- 2017
The specific scheduling goals and constraints are described, the workload traces collected in 2013–2017 from the 48-rack petascale supercomputer Mira are analyzed, and the upcoming scheduling challenges at ALCF are discussed.
Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems
- Computer Science2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
- 2013
A job power aware scheduling mechanism to reduce HPC's electricity bill without degrading the system utilization and preliminary results show that the power aware algorithm can reduce electricity bill of HPC systems as much as 23%.
The Effect of System Utilization on Application Performance Variability
- Business, Computer ScienceProceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers - ROSS '19
- 2019
This work-in-progress study investigates a scheduling policy to mitigate workload interference by leveraging the fact that production systems often exhibit diurnal utilization behavior and not all users are in a hurry for job completion.
Deep Reinforcement Learning framework for Autonomous Driving
- Computer ScienceArXiv
- 2017
The proposed framework for autonomous driving using deep reinforcement learning incorporates Recurrent Neural Networks for information integration, enabling the car to handle partially observable scenarios and integrates the recent work on attention models to focus on relevant information, thereby reducing the computational complexity for deployment on embedded hardware.
Joint Effects of Application Communication Pattern, Job Placement and Network Routing on Fat-Tree Systems
- Computer ScienceICPP Workshops
- 2018
Initial experimentation shows that the performance of HPC applications not only is related with the communication pattern, but also relies on the job placement and network routing on fat-tree systems.
Preliminary Interference Study About Job Placement and Routing Algorithms in the Fat-Tree Topology for HPC Applications
- Computer Science2017 IEEE International Conference on Cluster Computing (CLUSTER)
- 2017
Initial experimentation shows that the performance of HPC applications not only is related with its communication pattern, but also relies on the job placement and network routing on fat-tree systems.