• Corpus ID: 246867134

Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

  title={Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads},
  author={Dharma Shukla and Muthian Sivathanu and Srinidhi Viswanatha and Bhargav S. Gulavani and Rimma V. Nehme and Amey Agrawal and Chen Chen and Nipun Kwatra and Ramachandran Ramjee and Pankaj Sharma and Atul Katiyar and Vipul Modi and Vaibhav Sharma and Abhishek Singh and S. Singhal and Kaustubh Welankar and Lu Xun and Ravi Anupindi and Karthik Elangovan and Hasibur Rahman and Zhou Lin and Rahul Seetharaman and Chengda Xu and Eddie Ailijiang and Suresh Krishnappa and Mark Russinovich},
Lowering costs by driving high utilization across deep learning workloads is a crucial lever for cloud providers. We present Singularity, Microsoft's globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically scale deep learning workloads to drive high utilization without impacting their correctness or… 

Figures and Tables from this paper

Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs

Lucid is designed and implemented, a non-intrusive deep learning workload scheduler based on interpretable models that reduces the average job completion time and provides explicit system interpretations and excellent scalability for practical deployment.

Bottleneck Structure Graphs in ALTO: Use Cases and Requirements

ALTO new transport, which provides the transport functions of ALTO/SSE on top of HTTP/2, for more efficient ALTO transport and relaxes a constraint that was imposed by the ALTO specification on allowed cost mode values.

Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Gavel is proposed, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies that allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as average job completion time and makespan by up to 3.5x compared to heterogeneity-agnostic policies.

Themis: Fair and Efficient GPU Cluster Scheduling

Themis is a new scheduling framework for ML training workloads that uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter to capture placement sensitivity and ensure efficiency.

AntMan: Dynamic Scaling on GPU Clusters for Deep Learning

AntMan is presented, a deep learning infrastructure that co-designs cluster schedulers with deep learning frameworks and has been deployed in production at Alibaba to manage tens of thousands of daily deep learning jobs across thousands of GPUs.

Tiresias: A GPU Cluster Manager for Distributed Deep Learning

This work presents Tiresias, a GPU cluster manager tailored for distributed DL training jobs, which schedules and places DL jobs to reduce their job completion times (JCT), and proposes two scheduling algorithms that aim to minimize the average JCT.

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Optimus is proposed, a customized job scheduler for deep learning clusters, which minimizes job training time based on online resource-performance models, and sets up performance models to accurately estimate training speed as a function of allocated resources in each job.

Gandiva: Introspective Cluster Scheduling for Deep Learning

Gandiva is introduced, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster and achieves better utilization by transparently migrating and time-slicing jobs to achieve better job-to-resource fit.

Elastic Resource Sharing for Distributed Deep Learning

Apathetic Future Share is proposed, a DLT system framework that transparently handles automatic job parallelization and efficiently performs frequent share re-adjustments that outperforms Themis, SRTF, and Tiresias-L in terms of average JCT by up to 2.2x.

DMTCP: Transparent checkpointing for cluster computations and the desktop

Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster, and DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.

Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning

Gandivafair is the first scheduler that allocates cluster-wide GPU time fairly among active users and achieves efficiency and fairness despite cluster heterogeneity, and transparently incentivizes users to older GPUs.

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computations running natively on that same remote cluster.