Corpus ID: 53670168

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

@article{Huang2019GPipeET,
  title={GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism},
  author={Yanping Huang and Yonglong Cheng and Dehao Chen and HyoukJoong Lee and Jiquan Ngiam and Quoc V. Le and Z. Chen},
  journal={ArXiv},
  year={2019},
  volume={abs/1811.06965}
}
Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. [...] Key Method By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators. We…Expand
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
TLDR
A performance model based on the concurrency analysis of out-of-core training behavior is proposed, and a strategy that combines layer swapping and redundant recomputing is derived that can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG. Expand
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA
The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach toExpand
BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training
TLDR
A new pipeline parallelism training framework, BaPipe, which can automatically explore pipeline parallelist training methods and balanced partition strategies for DNN distributed training and provides up to 3.2x speedup and 4x memory reduction in various platforms. Expand
Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
TLDR
FTPipe is a system that explores a previously unexplored dimension of pipeline model parallelism, making multi-GPU execution of fine-tuning tasks for giant neural networks readily accessible, and achieves up to 3× speedup and state-of-the-art accuracy when fine- Tuning giant transformers with billions of parameters. Expand
Heterogeneous model parallelism for deep neural networks
TLDR
This work proposes a novel model-parallelism technique considering heterogeneous platforms, where a load balancing mechanism between uneven devices of an HPC platform has been implemented and takes advantage of the Google Brain’s Mesh-TensorFlow for convolutional networks. Expand
Automatic Graph Partitioning for Very Large-scale Deep Learning
TLDR
This work proposes RaNNC (Rapid Neural Network Connector) as middleware for automatic hybrid parallelism, and compared it with two popular frameworks, Megatron-LM (hybrid parallelism) and GPipe (originally proposed for model parallelism), for training models with increasingly greater numbers of parameters. Expand
Maximizing Parallelism in Distributed Training for Huge Neural Networks
TLDR
This work is the first to introduce a 3-dimensional model parallelism for expediting huge language models by reaching a perfect load balance, and presents smaller memory and communication cost than existing state-of-the-art 1-D and 2-D model Parallelism. Expand
Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism
TLDR
This paper proposes a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training, which reduces overall communication by 20x, improving end-to-end training time on industry scale models by 37%. Expand
PipeMare: Asynchronous Pipeline Parallel DNN Training
TLDR
This paper derives a simple but robust training method, called PipeMare, that tolerates asynchronous updates during pipeline-parallel execution and is the first to explore these techniques and fine-grained pipeline parallelism during neural network training. Expand
Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
TLDR
This work explores hybrid parallelization, where each data parallel worker comprises more than one device to accelerate each training step by exploiting model parallelism, and shows that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 83 REFERENCES
Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform
TLDR
This paper presents a pipelined model parallel execution method that enables high GPU utilization while maintaining robust training accuracy via a novel weight prediction technique, SpecTrain, and achieves up to 8.91x speedup compared to data parallelism on a 4-GPU platform while maintaining comparable model accuracy. Expand
Training Deep Nets with Sublinear Memory Cost
TLDR
This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost. Expand
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
TLDR
Experiments with five different DNNs on two different clusters show that PipeDream is up to 5x faster in time-to-accuracy compared to data-parallel training. Expand
Mesh-TensorFlow: Deep Learning for Supercomputers
TLDR
Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark. Expand
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
TLDR
The API design and the system implementation of MXNet are described, and it is explained how embedding of both symbolic expression and tensor operation is handled in a unified fashion. Expand
Mixed Precision Training
TLDR
This work introduces a technique to train deep neural networks using half precision floating point numbers, and demonstrates that this approach works for a wide variety of models including convolution neural networks, recurrent neural networks and generative adversarial networks. Expand
Large Scale Distributed Deep Networks
TLDR
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training. Expand
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
TLDR
This work builds a highly scalable deep learning training system for dense GPU clusters with three main contributions: a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy, an optimization approach for extremely large mini-batch size that can train CNN models on the ImageNet dataset without lost accuracy, and highly optimized all-reduce algorithms. Expand
On Model Parallelization and Scheduling Strategies for Distributed Machine Learning
TLDR
A system for model-parallelism, STRADS, that provides a programming abstraction for scheduling parameter updates by discovering and leveraging changing structural properties of ML programs, which enables a flexible tradeoff between scheduling efficiency and fidelity to intrinsic dependencies within the models, and improves memory efficiency of distributed ML. Expand
Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark
TLDR
DAWNBENCH entries are analyzed to show that TTA has a low coefficient of variation and that models optimized for TTA generalize nearly as well as those trained using standard methods, and it is found that distributed entries can spend more than half of their time on communication. Expand
...
1
2
3
4
5
...