• Corpus ID: 245999492

# Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

@inproceedings{Luo2018ParameterBH,
title={Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training},
author={Liangchen Luo and Jacob Nelson and Luis Ceze and Amar Phanishayee and Arvind Krishnamurthy},
year={2018}
}
• Published 30 January 2018
• Computer Science
Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with beer models. DNN training is oen seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger…
1 Citations

## Figures and Tables from this paper

• Computer Science
2020 International Symposium on Reliable Distributed Systems (SRDS)
• 2020
It is proved that MULTI-BULYAN can ensure a strong form of Byzantine resilience, as well as an $\frac{m}{n}$ slowdown, compared to averaging, the fastest (but non Byzantine resilient) rule for distributed machine learning.

## References

SHOWING 1-10 OF 15 REFERENCES

• Computer Science
ArXiv
• 2016
This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost.
• Computer Science
2016 7th International Conference on Cloud Computing and Big Data (CCBD)
• 2016
This paper presents an attempt to benchmark several state-of-the-art GPU-accelerated deep learning software tools, including Caffe, CNTK, TensorFlow, and Torch, and focuses on evaluating the running time performance of these tools with three popular types of neural networks on two representative CPU platforms and three representative GPU platforms.
• Computer Science
NIPS
• 2014
An in-depth analysis of two large scale machine learning problems ranging from l1 -regularized logistic regression on CPUs to reconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of samples and dimensions is presented.
• Computer Science
ArXiv
• 2017
This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.
• Computer Science
BigDataScience '14
• 2014
View on new challenges identified are shared, and some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines are covered.
• Computer Science
Proc. VLDB Endow.
• 2010
This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations and shows that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.
• Computer Science
ICLR
• 2017
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
• Computer Science
ASPLOS
• 2017
IncBricks is a hardware-software co-designed system that supports caching in the network using a programmable network middlebox that lowers request latency by over 30% and doubles throughput for 1024 byte values in a common cluster configuration.
• Computer Science
Proc. VLDB Endow.
• 2014
This first study of the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a Non-Uniform Memory Access (NUMA) machine discovers that there are tradeoffs between hardware and statistical efficiency.
• Computer Science
NSDI
• 2016
SwitchKV is a new key-value store system design that combines high-performance cache nodes with resource-constrained backend nodes to provide load balancing in the face of unpredictable workload