Pathways: Asynchronous Distributed Dataflow for ML

  title={Pathways: Asynchronous Distributed Dataflow for ML},
  author={Paul Barham and Aakanksha Chowdhery and Jeffrey Dean and Sanjay Ghemawat and Steven Hand and Daniel Hurt and Michael Isard and Hyeontaek Lim and Ruoming Pang and Sudip Roy and Brennan Saeta and Parker Schuh and Ryan Sepassi and Laurent El Shafey and Chandramohan A. Thekkath and Yonghui Wu},
We present the design of a new large scale orchestration layer for accelerators. Our system, PATHWAYS, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. PATHWAYS uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their… 
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Alpa automates model-parallel training of large deep learning models by generating execution plans that unify data, operator, and pipeline parallelism and generalizes to models with heterogeneous architectures and models without manually-designed plans.
Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures
JellyBean is designed and implemented, a framework for serving and optimizing machine learning inference workflows on heterogeneous infrastructures that reduces the total serving cost of visual question answering, vehicle tracking from the NVIDIA AI City Challenge, and vehicle tracking by up to 36% and outperforms prior ML serving systems up to 5x in serving costs.
PaLM: Scaling Language Modeling with Pathways
A 540-billion parameter, densely activated, Transformer language model, which is called PaLM achieves breakthrough performance, outperforming the state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark.
Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters
A general framework for collaboratively training deep learning models over remote heterogeneous clusters, the connections between which are low-bandwidth wide area networks (WANs) and the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware.
nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models
The proposed nuQmm reduces the latency of not only each GPU but also the entire inference of large LMs because a high compression ratio (by low-bit quantization) mitigates the minimum required number of GPUs.
CoCa: Contrastive Captioners are Image-Text Foundation Models
A minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM.
PANGUBOT: Efficient Generative Dialogue Pre-training from Pre-trained Language Model
P AN G U -B OT’s response quality, knowledge correctness, and safety are still far from perfect, and further explorations are indispensable to building reliable and smart dialogue systems.


Naiad: a timely dataflow system
It is shown that many powerful high-level programming models can be built on Naiad's low-level primitives, enabling such diverse tasks as streaming data analysis, iterative machine learning, and interactive graph mining.
The multikernel: a new OS architecture for scalable multicore systems
This work investigates a new OS structure, the multikernel, that treats the machine as a network of independent cores, assumes no inter-core sharing at the lowest level, and moves traditional OS functionality to a distributed system of processes that communicate via message-passing.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.
MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency
MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications, is proposed and evaluations show that MASK restores much of the throughput lost to TLB contention.
PipeMare: Asynchronous Pipeline Parallel DNN Training
This paper derives a simple but robust training method, called PipeMare, that tolerates asynchronous updates during pipeline-parallel execution and is the first to explore these techniques and fine-grained pipeline parallelism during neural network training.
PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications
The key idea is to leverage the layered structure of neural network models and their layer-by-layer computation pattern to pipeline model transmission over the PCIe and task execution in the GPU with model-aware grouping to improve GPU utilization.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.
AvA: Accelerated Virtualization of Accelerators
AvA provides near-native performance and can enforce sharing policies that are not possible with current techniques, with orders of magnitude less developer effort than required for hand-built virtualization support.
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
Gavel is proposed, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies that allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as average job completion time and makespan by up to 3.5x compared to heterogeneity-agnostic policies.
PipeDream: generalized pipeline parallelism for DNN training
PipeDream is presented, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.