• Corpus ID: 234357958

GSPMD: General and Scalable Parallelization for ML Computation Graphs

  title={GSPMD: General and Scalable Parallelization for ML Computation Graphs},
  author={Yuanzhong Xu and HyoukJoong Lee and Dehao Chen and Blake A. Hechtman and Yanping Huang and Rahul Joshi and Maxim Krikun and Dmitry Lepikhin and Andy Ly and Marcello Maggioni and Ruoming Pang and Noam M. Shazeer and Shibo Wang and Tao Wang and Yonghui Wu and Zhifeng Chen},
We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation. Its representation of partitioning is simple yet general, allowing it to express different or mixed paradigms of parallelism on a wide variety of models. GSPMD infers the partitioning… 

DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution

This work proposes DistIR, an expressive intermediate representation for distributed DNN computation that is tailored for efficient analyses, such as simulation, that enables automatically identifying the top-performing strategies without having to execute on physical hardware.

Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training

Amazon SageMaker model parallelism is presented, a software library that integrates with PyTorch, and enables easy training of large models using model Parallelism and other memory-saving features, which evaluates performance over GPT-3, RoBERTa, BERT, and neural collaborative filtering.

Whale: Efficient Giant Model Training over Heterogeneous GPUs

Whale, a general andcient distributed training framework for giant models, generalizes the programming interface by defining two new primitives in the form of model annotations, allowing for incorporating user hints and introduces a novel hardware-aware parallel strategy, which improves the performance of model training on heterogeneous GPUs in a balanced manner.

Scaling Up Models and Data with t5x and seqio

Two software libraries are presented: t5x simplifies the process of building and training large language models at scale while maintaining ease of use, and seqio provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines.

Decentralized Training of Foundation Models in Heterogeneous Environments

This paper presents the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network, and provides a formal cost model and an efficient evolutionary algorithm to find the optimal allocation strategy.

TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting

TSPLIT is a fine-grained DNN memory management system that breaks apart memory bottlenecks while maintaining the efficiency of DNNs training by proposing a model-guided approach to holistically exploit the tensor-split and its joint optimization with out-of-core execution methods (via offload and recompute).

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Alpa automates model-parallel training of large deep learning models by generating execution plans that unify data, operator, and pipeline parallelism and generalizes to models with heterogeneous architectures and models without manually-designed plans.

Automap: Towards Ergonomic Automated Parallelism for ML Models

This work presents the prototype of an automated partitioner that seamlessly integrates into existing compilers and existing user workflows and enables SPMD-style parallelism that encompasses data parallelism and parameter/activation sharding.

Sequence Parallelism: Long Sequence Training from System Perspective

This work proposes sequence parallelism, a memory-efficient parallelism method to help us break input sequence length limitation and train with longer sequences on GPUs ef ficiently, and is compatible with most existing parallelisms, which means it makes 4D parallelism possible.

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

This paper proposes and develops a family of language models named GLaM, which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.



GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding and it is demonstrated that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Tofu is a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint and describes the semantics of an operator in a simple language inspired by Halide to automatically partition each operator.

DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks

An analysis framework for DistIR programs, including a simulator and reference executor that can be used to automatically search for an optimal distribution strategy, and preliminary results using a grid search over a hybrid data/horizontal/pipeline-parallel space suggest DistIR and its simulator can aid automatic DNN distribution.

Mesh-TensorFlow: Deep Learning for Supercomputers

Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

This paper presents an approach to automatically shard the weight update computation across replicas with efficient communication primitives and data formatting, using static analysis and transformations on the training computation graph, and achieves substantial speedups on typical image and language models on Cloud TPUs, requiring no change to model code.

Efficient Algorithms for Device Placement of DNN Graph Operators

This paper identifies and isolate the structured optimization problem at the core of device placement of DNN operators, for both inference and training, especially in modern pipelined settings, and provides algorithms that solve this problem to optimality.

Glow: Graph Lowering Compiler Techniques for Neural Networks

Glow features a lowering phase which enables the compiler to support a high number of input operators as well as a large number of hardware targets by eliminating the need to implement all operators on all targets.

DAPPLE: a pipelined data parallel approach for training large models

DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models, is proposed, which features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategies of data and pipeline Parallelism.

PipeDream: generalized pipeline parallelism for DNN training

PipeDream is presented, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible.

Beyond Data and Model Parallelism for Deep Neural Networks

A more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions is defined and FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine is proposed.