Deep Reinforcement Agent for Scheduling in HPC

  title={Deep Reinforcement Agent for Scheduling in HPC},
  author={Yuping Fan and Zhiling Lan and J. Taylor Childers and Paul M. Rich and William E. Allcock and Michael E. Papka},
  journal={2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
  • Yuping FanZ. Lan M. Papka
  • Published 11 February 2021
  • Computer Science
  • 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Cluster scheduler is crucial in high-performance computing (HPC). It determines when and which user jobs should be allocated to available system resources. Existing cluster scheduling heuristics are developed by human experts based on their experience with specific HPC systems and workloads. However, the increasing complexity of computing systems and the highly dynamic nature of application workloads have placed tremendous burden on manually designed and tuned scheduling heuristics. More… 

MRSch: Multi-Resource Scheduling for HPC

  • Boyang LiYuping Fan M. Papka
  • Computer Science, Business
    2022 IEEE International Conference on Cluster Computing (CLUSTER)
  • 2022
An intelligent scheduling agent named MRSch is presented for multi-resource scheduling in HPC that leverages direct future prediction (DFP), an advanced multi-objective reinforcement learning algorithm to tackle the challenges involved in multi- resource scheduling.

Deep Reinforcement Agent for Failure-aware Job scheduling in High-Performance Computing

This paper proposes a novel HPC scheduling agent named FARS (Failure-aware RL-based scheduler) by considering the effects of job failures, which shows that, compared with the best baseline model, FARS obtains 5.69% improvement of average make-span under different device error rates.

DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing

This work presents an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning and demonstrates that DRAS outperforms the existing heuristic and optimization approaches by up to 50%.

SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning

One key advantage of SchedInspector is it automatically learns to work with and improve existing job scheduling policies without changing them, which makes it promising to serve as a generic enhancer for various batchJob scheduling policies.

Hybrid Workload Scheduling on HPC Systems

This study presents several scheduling mechanisms to address the issues involved in co-scheduling on-demand, rigid, and malleable jobs on a single HPC system, and extensively evaluate and compare their performance under various configurations and workloads.

Not All Tasks Are Created Equal: Adaptive Resource Allocation for Heterogeneous Tasks in Dynamic Workflows

It is shown that tasks performing different work may have significantly different resource consumption and that exploiting the heterogeneity of tasks is a desirable way to reveal and predict the relationship between tasks and their resource consumption, reduce waste from resource misallocation, increase tasks' consumption efficiency, and incentivize users' cooperation.

RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction

RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM, and the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning- based schedulers.

Job Scheduling in High Performance Computing

This research study investigated challenges faced by HPC scheduling and state-of-art scheduling methods to overcome these challenges, and proposed an intelligent scheduling framework to alleviate the problems encountered in modern job scheduling.

On the impact of MDP design for Reinforcement Learning agents in Resource Management

It is shown that, in the authors' experiments, when using Multi-Layer Perceptrons as approximation function, a compact state representation allows transfer of agents between environments, and that transferred agents have good performance and outperform specialized agents in 80% of the tested scenarios, even without retraining.

Application Checkpoint and Power Study on Large Scale Systems

This study analyzes the relation of application checkpoints and their power consumption and the observations could guide the design of power management.

Learning scheduling algorithms for data processing clusters

It is shown that modern machine learning techniques can generate highly-efficient policies automatically and improve average job completion time by at least 21% over hand-tuned scheduling heuristics, achieving up to 2x improvement during periods of high cluster load.

Multi-Resource Packing for Cluster Schedulers

This work presents Tetris, a cluster scheduler that packs, i.e., matches multi-resource task requirements with resource availabilities of machines so as to increase cluster efficiency (makespan).

Resource Management with Deep Reinforcement Learning

This work presents DeepRM, an example solution that translates the problem of packing tasks with multiple resource demands into a learning problem, and shows that it performs comparably to state-of-the-art heuristics, adapts to different conditions, converges quickly, and learns strategies that are sensible in hindsight.

Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling

The study of backfilling to the accuracy of the runtime estimates provided by the users and a very surprising result is found: Backfilling actually works better when users overestimate the runtime by a substantial factor.

Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne

The specific scheduling goals and constraints are described, the workload traces collected in 2013–2017 from the 48-rack petascale supercomputer Mira are analyzed, and the upcoming scheduling challenges at ALCF are discussed.

Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems

  • Xu YangZhou Zhou M. Papka
  • Computer Science
    2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
  • 2013
A job power aware scheduling mechanism to reduce HPC's electricity bill without degrading the system utilization and preliminary results show that the power aware algorithm can reduce electricity bill of HPC systems as much as 23%.

The Effect of System Utilization on Application Performance Variability

This work-in-progress study investigates a scheduling policy to mitigate workload interference by leveraging the fact that production systems often exhibit diurnal utilization behavior and not all users are in a hurry for job completion.

Deep Reinforcement Learning framework for Autonomous Driving

The proposed framework for autonomous driving using deep reinforcement learning incorporates Recurrent Neural Networks for information integration, enabling the car to handle partially observable scenarios and integrates the recent work on attention models to focus on relevant information, thereby reducing the computational complexity for deployment on embedded hardware.

Joint Effects of Application Communication Pattern, Job Placement and Network Routing on Fat-Tree Systems

Initial experimentation shows that the performance of HPC applications not only is related with the communication pattern, but also relies on the job placement and network routing on fat-tree systems.

Preliminary Interference Study About Job Placement and Routing Algorithms in the Fat-Tree Topology for HPC Applications

Initial experimentation shows that the performance of HPC applications not only is related with its communication pattern, but also relies on the job placement and network routing on fat-tree systems.