# Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

@article{Jia2018ExploringHD, title={Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks}, author={Zhihao Jia and Sina Lin and C. Qi and Alexander Aiken}, journal={ArXiv}, year={2018}, volume={abs/1802.04924} }

The past few years have witnessed growth in the size and computational requirements for training deep convolutional neural networks. Current approaches parallelize the training process onto multiple devices by applying a single parallelization strategy (e.g., data or model parallelism) to all layers in a network. Although easy to reason about, this design results in suboptimal runtime performance in large-scale distributed training, since different layers in a network may prefer different… Expand

#### Figures, Tables, and Topics from this paper

#### 53 Citations

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

- Computer Science, Mathematics
- IEEE Micro
- 2019

This work explores hybrid parallelization, where each data parallel worker comprises more than one device to accelerate each training step by exploiting model parallelism, and shows that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. Expand

Beyond Data and Model Parallelism for Deep Neural Networks

- Computer Science
- MLSys
- 2019

A more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions is defined and FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine is proposed. Expand

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

- Computer Science
- HPDC
- 2021

This work analyzes the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability and concludes that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for data parallelism. Expand

Accelerating Distributed SGD With Group Hybrid Parallelism

- Computer Science
- IEEE Access
- 2021

This work proposed an efficient parallelism strategy named group hybrid parallelism (GHP) to minimize the training time without any accuracy loss, and evaluated the heuristics that determine the parallelization strategy for minimizing training time. Expand

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism

- Computer Science, Mathematics
- ArXiv
- 2020

FT, an efficient algorithm that searches for an optimal set of parallelization strategies to allow the trade-off among different objectives, is proposed and a user-friendly system, called TensorOpt, is developed, which allows users to run their distributed DNN training jobs without caring the details of parallelized strategies. Expand

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

- Computer Science
- ArXiv
- 2018

This paper presents a pipelined model parallel execution method that enables high GPU utilization while maintaining robust training accuracy via a novel weight prediction technique, SpecTrain, and achieves up to 8.91x speedup compared to data parallelism on a 4-GPU platform while maintaining comparable model accuracy. Expand

Proposal Machine Learning Parallelism Could Be Adaptive , Composable and Automated

- 2019

In recent years, the pace of innovations in the fields of machine learning has accelerated. To cope with the sheer computational complexity of training large ML models on large datasets, researchers… Expand

Partitioning sparse deep neural networks for scalable training and inference

- Computer Science
- ICS
- 2021

A distributed-memory parallel SpMV-based solution for the SGD algorithm to improve its scalability and a novel hypergraph model for partitioning weight matrices to reduce the total communication volume and ensure computational load-balance among processors are proposed. Expand

Fast Training of Deep Learning Models over Multiple GPUs

- Computer Science
- Middleware
- 2020

This paper proposes FastT, a transparent module to work with the TensorFlow framework for automatically identifying a satisfying deployment and execution order of operations in DNN models over… Expand

Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism

- Computer Science
- ArXiv
- 2020

This paper proposes a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training, which reduces overall communication by 20x, improving end-to-end training time on industry scale models by 37%. Expand

#### References

SHOWING 1-10 OF 29 REFERENCES

Learning the Number of Neurons in Deep Networks

- Computer Science
- NIPS
- 2016

This paper proposes to make use of a group sparsity regularizer on the parameters of the network, where each group is defined to act on a single neuron, and shows that this approach can reduce the number of parameters by up to 80\% while retaining or even improving the network accuracy. Expand

Learning both Weights and Connections for Efficient Neural Network

- Computer Science
- NIPS
- 2015

A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method. Expand

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

- Computer Science, Mathematics
- USENIX Annual Technical Conference
- 2017

Poseidon exploits the layered model structures in DL programs to overlap communication and computation, reducing bursty network communication and is applicable to different DL frameworks by plugging Poseidon into Caffe and TensorFlow. Expand

Large Scale Distributed Deep Networks

- Computer Science
- NIPS
- 2012

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training. Expand

Going deeper with convolutions

- Computer Science
- 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition… Expand

SplitNet: Learning to Semantically Split Deep Networks for Parameter Reduction and Model Parallelization

- Computer Science
- ICML
- 2017

A novel deep neural network that is both lightweight and effectively structured for model parallelization, which obtains networks with significantly reduced number of parameters while achieving comparable or superior accuracies over original full deep networks, and accelerated test speed with multiple GPUs is proposed. Expand

Device Placement Optimization with Reinforcement Learning

- Computer Science
- ICML
- 2017

A method which learns to optimize device placement for TensorFlow computational graphs using a sequence-to-sequence model, which finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods. Expand

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

- Computer Science
- ArXiv
- 2017

This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency. Expand

Very Deep Convolutional Networks for Large-Scale Image Recognition

- Computer Science
- ICLR
- 2015

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. Expand

ImageNet classification with deep convolutional neural networks

- Computer Science
- Commun. ACM
- 2012

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective. Expand