• Corpus ID: 15122323

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

  title={Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines},
  author={H. Zhang and Zhiting Hu and Jinliang Wei and Pengtao Xie and Gunhee Kim and Qirong Ho and Eric P. Xing},
Deep learning (DL) has achieved notable successes in many machine learning tasks. A number of frameworks have been developed to expedite the process of designing and training deep neural networks (DNNs), such as Caffe, Torch and Theano. Currently they can harness multiple GPUs on a single machine, but are unable to use GPUs that are distributed across multiple machines; as even average-sized DNNs can take days to train on a single GPU with 100s of GBs to TBs of data, distributed GPUs present a… 

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Poseidon exploits the layered model structures in DL programs to overlap communication and computation, reducing bursty network communication and is applicable to different DL frameworks by plugging Poseidon into Caffe and TensorFlow.

GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server

GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.

Involving CPUs into Multi-GPU Deep Learning

A novel approach to data parallel training called CPU-GPU data parallel (CGDP) training that utilizes free CPU time on the host to speed up the training in the GPUs and presents a cost model for analyzing and comparing the performances of both the typical data parallel trains.


Experimental results demonstrate that NUMA-Caffe significantly outperforms the state-of-the-art Caffe designs in terms of both throughput and scalability.

Profiling DNN Workloads on a Volta-based DGX-1 System

This work profile and analyze the training of five popular DNNs using 1, 2, 4 and 8 GPUs, and shows the breakdown of the training time across the FP+ BP stage and the WU stage to provide insights about the limiting factors of theTraining algorithm as well as to identify the bottlenecks in the multi-GPU system architecture.

Improving the performance of dataflow systems for deep neural network training

Ako, a DNN system that uses partial gradient exchange for synchronising replicas in a peer-to-peer fashion and exhibits a 25% lower convergence time than a hand-tuned parameter-server deployments is presented.

TensorLightning: A Traffic-Efficient Distributed Deep Learning on Commodity Spark Clusters

This work proposes TensorLightning which integrates the widely used data pipeline of Apache Spark with powerful deep learning libraries, Caffe and TensorFlow, and redesigns the elastic averaging stochastic gradient descent algorithm with pruned and sparse form parameters.

Stanza: Layer Separation for Distributed Training in Deep Learning

This work proposes layer separation in distributed training: most nodes of the cluster train only the convolutional layers, while the rest train the fully connected layers, thereby substantially reducing the data transfer volume.

Ebird: Elastic Batch for Improving Responsiveness and Throughput of Deep Learning Services

Ebird, a deep learning serving system that is comprised of a GPU-resident memory pool, a multi-granularity inference engine, and an elastic batch scheduler, is proposed, which reduces the response latency of inferences and improves the throughput while guaranteeing the QoS target compared with TensorFlow Serving.

Survey of scaling platforms for Deep Neural Networks

Different approaches have been proposed to scale processing on cluster of GPU servers for deep neural networks using General Purpose GPUs.

Multi-GPU Training of ConvNets

This work isolates the impact of parallelism, while using standard supervised back-propagation and synchronous mini-batch stochastic gradient descent to investigate methods to speed convergence by parallelizing training across multiple GPUs.

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

SparkNet: Training Deep Networks in Spark

This work introduces SparkNet, a framework for training deep networks in Spark using a simple parallelization scheme for stochastic gradient descent that scales well with the cluster size and tolerates very high-latency communication.

Project Adam: Building an Efficient and Scalable Deep Learning Training System

The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.

Deep learning with COTS HPC systems

This paper presents technical details and results from their own system based on Commodity Off-The-Shelf High Performance Computing (COTS HPC) technology: a cluster of GPU servers with Infiniband interconnects and MPI, and shows that it can scale to networks with over 11 billion parameters using just 16 machines.

Fast Convolutional Nets With fbfft: A GPU Performance Evaluation

We examine the performance profile of Convolutional Neural Network training on the current generation of NVIDIA Graphics Processing Units. We introduce two new Fast Fourier Transform convolution

Petuum: A New Platform for Distributed Machine Learning on Big Data

This work proposes a general-purpose framework, Petuum, that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions.

Theano: Deep Learning on GPUs with Python

This paper presents Theano, a framework in the Python programming language for defining, optimizing and evaluating expressions involving high-level operations on tensors, and adds automatic symbolic differentiation, GPU support, and faster expression evaluation.

Building high-level features using large scale unsupervised learning

  • Quoc V. LeM. Ranzato A. Ng
  • Computer Science
    2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.