Corpus ID: 3619071

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

@article{Jia2018ExploringHD,
  title={Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks},
  author={Zhihao Jia and Sina Lin and C. Qi and Alexander Aiken},
  journal={ArXiv},
  year={2018},
  volume={abs/1802.04924}
}
The past few years have witnessed growth in the size and computational requirements for training deep convolutional neural networks. Current approaches parallelize the training process onto multiple devices by applying a single parallelization strategy (e.g., data or model parallelism) to all layers in a network. Although easy to reason about, this design results in suboptimal runtime performance in large-scale distributed training, since different layers in a network may prefer different… Expand
Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training
TLDR
This work explores hybrid parallelization, where each data parallel worker comprises more than one device to accelerate each training step by exploiting model parallelism, and shows that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. Expand
Beyond Data and Model Parallelism for Deep Neural Networks
TLDR
A more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions is defined and FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine is proposed. Expand
An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
TLDR
This work analyzes the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability and concludes that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism. Expand
Accelerating Distributed SGD With Group Hybrid Parallelism
TLDR
This work proposed an efficient parallelism strategy named group hybrid parallelism (GHP) to minimize the training time without any accuracy loss, and evaluated the heuristics that determine the parallelization strategy for minimizing training time. Expand
TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism
TLDR
FT, an efficient algorithm that searches for an optimal set of parallelization strategies to allow the trade-off among different objectives, is proposed and a user-friendly system, called TensorOpt, is developed, which allows users to run their distributed DNN training jobs without caring the details of parallelized strategies. Expand
Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform
TLDR
This paper presents a pipelined model parallel execution method that enables high GPU utilization while maintaining robust training accuracy via a novel weight prediction technique, SpecTrain, and achieves up to 8.91x speedup compared to data parallelism on a 4-GPU platform while maintaining comparable model accuracy. Expand
Proposal Machine Learning Parallelism Could Be Adaptive , Composable and Automated
In recent years, the pace of innovations in the fields of machine learning has accelerated. To cope with the sheer computational complexity of training large ML models on large datasets, researchersExpand
Partitioning sparse deep neural networks for scalable training and inference
TLDR
A distributed-memory parallel SpMV-based solution for the SGD algorithm to improve its scalability and a novel hypergraph model for partitioning weight matrices to reduce the total communication volume and ensure computational load-balance among processors are proposed. Expand
Fast Training of Deep Learning Models over Multiple GPUs
This paper proposes FastT, a transparent module to work with the TensorFlow framework for automatically identifying a satisfying deployment and execution order of operations in DNN models overExpand
Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism
TLDR
This paper proposes a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training, which reduces overall communication by 20x, improving end-to-end training time on industry scale models by 37%. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 29 REFERENCES
Learning the Number of Neurons in Deep Networks
TLDR
This paper proposes to make use of a group sparsity regularizer on the parameters of the network, where each group is defined to act on a single neuron, and shows that this approach can reduce the number of parameters by up to 80\% while retaining or even improving the network accuracy. Expand
Learning both Weights and Connections for Efficient Neural Network
TLDR
A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method. Expand
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
TLDR
Poseidon exploits the layered model structures in DL programs to overlap communication and computation, reducing bursty network communication and is applicable to different DL frameworks by plugging Poseidon into Caffe and TensorFlow. Expand
Large Scale Distributed Deep Networks
TLDR
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training. Expand
Going deeper with convolutions
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual RecognitionExpand
SplitNet: Learning to Semantically Split Deep Networks for Parameter Reduction and Model Parallelization
TLDR
A novel deep neural network that is both lightweight and effectively structured for model parallelization, which obtains networks with significantly reduced number of parameters while achieving comparable or superior accuracies over original full deep networks, and accelerated test speed with multiple GPUs is proposed. Expand
Device Placement Optimization with Reinforcement Learning
TLDR
A method which learns to optimize device placement for TensorFlow computational graphs using a sequence-to-sequence model, which finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods. Expand
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
TLDR
This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency. Expand
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. Expand
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective. Expand
...
1
2
3
...