Corpus ID: 195776537

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

@article{Mahajan2019ThemisFA,
  title={Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads},
  author={K. Mahajan and Arjun Singhvi and A. Balasubramanian and Varun Batra and Surya Teja Chavali and S. Venkataraman and A. Akella and Amar Phanishayee and Shuchi Chawla},
  journal={ArXiv},
  year={2019},
  volume={abs/1907.01484}
}
  • K. Mahajan, Arjun Singhvi, +6 authors Shuchi Chawla
  • Published 2019
  • Computer Science
  • ArXiv
  • Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads while ensuring overall cluster efficiency. We find that established cluster scheduling disciplines that provide instantaneous fair share of resources are a poor fit because of ML workloads' unique attributes. ML jobs are… CONTINUE READING
    4 Citations

    References

    SHOWING 1-10 OF 23 REFERENCES
    Tiresias: A GPU Cluster Manager for Distributed Deep Learning
    • 57
    • Highly Influential
    • PDF
    Quincy: fair scheduling for distributed computing clusters
    • 839
    • PDF
    Multi-resource packing for cluster schedulers
    • 291
    • PDF
    Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
    • 1,417
    • PDF
    SLAQ: quality-driven scheduling for distributed machine learning
    • 59
    • Highly Influential
    • PDF
    Optimus: an efficient dynamic resource scheduler for deep learning clusters
    • 103
    • PDF
    Gandiva: Introspective Cluster Scheduling for Deep Learning
    • 102
    • PDF
    GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters
    • 103
    • PDF
    Sparrow: distributed, low latency scheduling
    • 496
    • PDF