Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines
@article{Zhang2015PoseidonAS, title={Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines}, author={H. Zhang and Zhiting Hu and Jinliang Wei and Pengtao Xie and Gunhee Kim and Qirong Ho and Eric P. Xing}, journal={ArXiv}, year={2015}, volume={abs/1512.06216} }
Deep learning (DL) has achieved notable successes in many machine learning tasks. A number of frameworks have been developed to expedite the process of designing and training deep neural networks (DNNs), such as Caffe, Torch and Theano. Currently they can harness multiple GPUs on a single machine, but are unable to use GPUs that are distributed across multiple machines; as even average-sized DNNs can take days to train on a single GPU with 100s of GBs to TBs of data, distributed GPUs present a…
Figures and Tables from this paper
49 Citations
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
- Computer ScienceUSENIX Annual Technical Conference
- 2017
Poseidon exploits the layered model structures in DL programs to overlap communication and computation, reducing bursty network communication and is applicable to different DL frameworks by plugging Poseidon into Caffe and TensorFlow.
GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server
- Computer ScienceEuroSys
- 2016
GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.
Involving CPUs into Multi-GPU Deep Learning
- Computer ScienceICPE
- 2018
A novel approach to data parallel training called CPU-GPU data parallel (CGDP) training that utilizes free CPU time on the host to speed up the training in the GPUs and presents a cost model for analyzing and comparing the performances of both the typical data parallel trains.
NUMA-Caffe
- Computer ScienceACM Trans. Archit. Code Optim.
- 2018
Experimental results demonstrate that NUMA-Caffe significantly outperforms the state-of-the-art Caffe designs in terms of both throughput and scalability.
Profiling DNN Workloads on a Volta-based DGX-1 System
- Computer Science2018 IEEE International Symposium on Workload Characterization (IISWC)
- 2018
This work profile and analyze the training of five popular DNNs using 1, 2, 4 and 8 GPUs, and shows the breakdown of the training time across the FP+ BP stage and the WU stage to provide insights about the limiting factors of theTraining algorithm as well as to identify the bottlenecks in the multi-GPU system architecture.
Improving the performance of dataflow systems for deep neural network training
- Computer Science
- 2017
Ako, a DNN system that uses partial gradient exchange for synchronising replicas in a peer-to-peer fashion and exhibits a 25% lower convergence time than a hand-tuned parameter-server deployments is presented.
TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters
- Computer ScienceIEEE Access
- 2018
This work proposes TensorLightning which integrates the widely used data pipeline of Apache Spark with powerful deep learning libraries, Caffe and TensorFlow, and redesigns the elastic averaging stochastic gradient descent algorithm with pruned and sparse form parameters.
Stanza: Layer Separation for Distributed Training in Deep Learning
- Computer ScienceIEEE Transactions on Services Computing
- 2022
This work proposes layer separation in distributed training: most nodes of the cluster train only the convolutional layers, while the rest train the fully connected layers, thereby substantially reducing the data transfer volume.
Ebird: Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services
- Computer Science2019 IEEE 37th International Conference on Computer Design (ICCD)
- 2019
Ebird, a deep learning serving system that is comprised of a GPU-resident memory pool, a multi-granularity inference engine, and an elastic batch scheduler, is proposed, which reduces the response latency of inferences and improves the throughput while guaranteeing the QoS target compared with TensorFlow Serving.
Survey of scaling platforms for Deep Neural Networks
- Computer Science2016 International Conference on Emerging Trends in Communication Technologies (ETCT)
- 2016
Different approaches have been proposed to scale processing on cluster of GPU servers for deep neural networks using General Purpose GPUs.
30 References
Multi-GPU Training of ConvNets
- Computer ScienceICLR
- 2014
This work isolates the impact of parallelism, while using standard supervised back-propagation and synchronous mini-batch stochastic gradient descent to investigate methods to speed convergence by parallelizing training across multiple GPUs.
Caffe: Convolutional Architecture for Fast Feature Embedding
- Computer ScienceACM Multimedia
- 2014
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Large Scale Distributed Deep Networks
- Computer ScienceNIPS
- 2012
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
SparkNet: Training Deep Networks in Spark
- Computer ScienceICLR
- 2016
This work introduces SparkNet, a framework for training deep networks in Spark using a simple parallelization scheme for stochastic gradient descent that scales well with the cluster size and tolerates very high-latency communication.
Project Adam: Building an Efficient and Scalable Deep Learning Training System
- Computer ScienceOSDI
- 2014
The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.
Deep learning with COTS HPC systems
- Computer ScienceICML
- 2013
This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines.
Fast Convolutional Nets With fbfft: A GPU Performance Evaluation
- Computer ScienceICLR
- 2015
We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution…
Petuum: A New Platform for Distributed Machine Learning on Big Data
- Computer ScienceIEEE Transactions on Big Data
- 2015
This work proposes a general-purpose framework, Petuum, that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions.
Theano: Deep Learning on GPUs with Python
- Computer Science
- 2012
This paper presents Theano, a framework in the Python programming language for defining, optimizing and evaluating expressions involving high-level operations on tensors, and adds automatic symbolic differentiation, GPU support, and faster expression evaluation.
Building high-level features using large scale unsupervised learning
- Computer Science2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013
Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.