Corpus ID: 221340551

Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

@article{Qiao2021PolluxCC,
  title={Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning},
  author={A. Qiao and W. Neiswanger and Qirong Ho and Hao Zhang and G. Ganger and E. Xing},
  journal={ArXiv},
  year={2021},
  volume={abs/2008.12260}
}
Pollux improves scheduling performance in deep learning (DL) clusters by adaptively co-optimizing inter-dependent factors both at the per-job level and at the cluster-wide level. Most existing schedulers will assign each job a number of resources requested by the user, which can allow jobs to use those resources inefficiently. Some recent schedulers choose job resources for users, but do so without awareness of how DL training can be re-optimized to better utilize those resources. Pollux… Expand

Figures and Tables from this paper

BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes
TLDR
This work describes how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and shows that this MILP problem can be solved efficiently at run time. Expand
Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters
  • Qinghao Hu, Peng Sun, Shengen Yan, Yonggang Wen, Tianwei Zhang
  • Computer Science
  • 2021
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource schedulingExpand
Srift: Swift and Thrift Cloud-Based Distributed Training
TLDR
It is shown Srift's choices of VM instances can lead to up to 2x better throughput and 1.6x lower cost per iteration compared to baseline choices across various DNN models in real-world scenarios, leveraging heterogeneous setups and spot instances. Expand
On the Future of Cloud Engineering
Ever since the commercial offerings of the Cloud started appearing in 2006, the landscape of cloud computing has been undergoing remarkable changes with the emergence of many different types ofExpand
Simple and Automatic Distributed Machine Learning on Ray
  • Hao Zhang, Zhuohan Li, Lianmin Zheng, I. Stoica
  • Computer Science
  • KDD
  • 2021
In recent years, the pace of innovations in the fields of machine learning (ML) has accelerated, researchers in SysML have created algorithms and systems that parallelize ML training over multipleExpand

References

SHOWING 1-10 OF 90 REFERENCES
Optimus: an efficient dynamic resource scheduler for deep learning clusters
TLDR
Optimus is proposed, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job. Expand
SLAQ: quality-driven scheduling for distributed machine learning
TLDR
SLAQ is described, a cluster scheduling system for approximate ML training jobs that aims to maximize the overall job quality and leverages the iterative nature of ML training algorithms, by collecting quality and resource usage information from concurrent jobs, and then generating highly-tailored quality-improvement predictions for future iterations. Expand
Gandiva: Introspective Cluster Scheduling for Deep Learning
TLDR
Gandiva is introduced, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster and achieves better utilization by transparently migrating and time-slicing jobs to achieve better job-to-resource fit. Expand
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
TLDR
Gavel is proposed, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies that allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as average job completion time and makespan by up to 3.5x compared to heterogeneity-agnostic policies. Expand
Tiresias: A GPU Cluster Manager for Distributed Deep Learning
TLDR
This work presents Tiresias, a GPU cluster manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCTs), and proposes two scheduling algorithms that aim to minimize the average JCT. Expand
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
TLDR
AntMan, a deep learning infrastructure that co-designs cluster schedulers with deep learning frameworks and has been deployed in production at Alibaba to manage tens of thousands of daily deep learning jobs across thousands of GPUs, is presented. Expand
Themis: Fair and Efficient GPU Cluster Scheduling
TLDR
Themis is a new scheduling framework for ML training workloads that uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter to capture placement sensitivity and ensure efficiency. Expand
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
TLDR
A detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise is presented and design guidelines pertaining to next-generation cluster schedulers for DNN training workloads are provided. Expand
KungFu: Making Training in Distributed Machine Learning Adaptive
TLDR
KungFu is described, a distributed ML library for TensorFlow that is designed to enable adaptive training and allows users to express high-level Adaptation Policies (APs) that describe how to change hyperand system parameters during training. Expand
Anytime Minibatch: Exploiting Stragglers in Online Distributed Optimization
TLDR
Anytime Minibatch preventsstragglers from holding up the system without wasting the work that stragglers can complete, and is up to 1.5 times faster in Amazon EC2 and up to five times faster when there is greater variability in compute node performance. Expand
...
1
2
3
4
5
...