• Corpus ID: 245999492

Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

  title={Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training},
  author={Liangchen Luo and Jacob Nelson and Luis Ceze and Amar Phanishayee and Arvind Krishnamurthy},
Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with beŠer models. DNN training is o‰en seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger… 

Figures and Tables from this paper

Fast and Robust Distributed Learning in High Dimension

It is proved that MULTI-BULYAN can ensure a strong form of Byzantine resilience, as well as an $\frac{m}{n}$ slowdown, compared to averaging, the fastest (but non Byzantine resilient) rule for distributed machine learning.



Training Deep Nets with Sublinear Memory Cost

This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost.

Benchmarking State-of-the-Art Deep Learning Software Tools

This paper presents an attempt to benchmark several state-of-the-art GPU-accelerated deep learning software tools, including Caffe, CNTK, TensorFlow, and Torch, and focuses on evaluating the running time performance of these tools with three popular types of neural networks on two representative CPU platforms and three representative GPU platforms.

Communication Efficient Distributed Machine Learning with the Parameter Server

An in-depth analysis of two large scale machine learning problems ranging from l1 -regularized logistic regression on CPUs to reconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of samples and dimensions is presented.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.

Scaling Distributed Machine Learning with the Parameter Server

View on new challenges identified are shared, and some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines are covered.

An architecture for parallel topic models

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations and shows that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.

IncBricks: Toward In-Network Computation with an In-Network Cache

IncBricks is a hardware-software co-designed system that supports caching in the network using a programmable network middlebox that lowers request latency by over 30% and doubles throughput for 1024 byte values in a common cluster configuration.

DimmWitted: A Study of Main-Memory Statistical Analytics

This first study of the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a Non-Uniform Memory Access (NUMA) machine discovers that there are tradeoffs between hardware and statistical efficiency.

Be Fast, Cheap and in Control with SwitchKV

SwitchKV is a new key-value store system design that combines high-performance cache nodes with resource-constrained backend nodes to provide load balancing in the face of unpredictable workload