• Corpus ID: 2185117

Project Adam: Building an Efficient and Scalable Deep Learning Training System

@inproceedings{Chilimbi2014ProjectAB,
  title={Project Adam: Building an Efficient and Scalable Deep Learning Training System},
  author={Trishul M. Chilimbi and Yutaka Suzue and Johnson Apacible and Karthik Kalyanaraman},
  booktitle={OSDI},
  year={2014}
}
Large deep neural network models have recently demonstrated state-of-the-art accuracy on hard visual recognition tasks. Unfortunately such models are extremely time consuming to train and require large amount of compute cycles. We describe the design and implementation of a distributed system called Adam comprised of commodity server machines to train such models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks. Adam achieves high efficiency and… 
Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train
TLDR
The challenges and novel solutions needed in order to train ResNet-50 in this large scale environment are described and the novel Collapsed Ensemble (CE) technique is introduced that allows for a 77.5\% top-1 accuracy, similar to that of a Res net-152, while training a unmodified Res Net-50 topology for the same fixed training budget.
Parallax: Automatic Data-Parallel Training of Deep Neural Networks
TLDR
Parallax is introduced, a tool for automatic parallelization of deep learning training in distributed environments that handles the subtle correctness issues, but also leverages various optimizations to minimize the communication overhead caused by scaling out.
PowerAI DDL
TLDR
A software-hardware co-optimized distributed Deep Learning system that can achieve near-linear scaling up to hundreds of GPUs using a multi-ring communication pattern that provides a good tradeoff between latency and bandwidth and adapts to a variety of system configurations.
Benchmarking and Analyzing Deep Neural Network Training
TLDR
This work proposes a new benchmark suite for DNN training, called TBD, and presents a new toolchain for performance analysis for these models that combines the targeted usage of existing performance analysis tools, careful selection of performance metrics, and methodologies to analyze the results.
Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems
TLDR
Performance models that quantify the impact of partitioning and provisioning decisions on overall distributed system performance and scalability and a scalability optimizer that efficiently determines the optimal system configuration that minimizes DNN training time are developed.
Accelerating Deep Neural Network Training for Action Recognition on a Cluster of GPUs
TLDR
This work proposes algorithms and techniques to accelerate training of deep neural networks for action recognition on a cluster of GPUs, and achieves super-linear speedups on 16 GPUs while improving validation accuracy.
TensorFlow: A system for large-scale machine learning
TLDR
The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.
Improving the performance of dataflow systems for deep neural network training
TLDR
Ako, a DNN system that uses partial gradient exchange for synchronising replicas in a peer-to-peer fashion and exhibits a 25% lower convergence time than a hand-tuned parameter-server deployments is presented.
Using Supercomputer to Speed up Neural Network Training
  • Yue Yu, Jinrong Jiang, X. Chi
  • Computer Science
    2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)
  • 2016
TLDR
This paper has developed a framework based on Caffe called Caffe-HPC that can utilize computing clusters with multiple GPUs to train large models and makes it possible to train larger networks on larger training sets in a reasonable amount of time.
Channel and filter parallelism for large-scale CNN training
TLDR
This work introduces three algorithms that partition channel or filter data to exploit parallelism beyond the sample dimension, and partition the parameters of convolutional layers, replacing global all reduces with segmented allreduces among disjoint processor sets.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
Large Scale Distributed Deep Networks
TLDR
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Scaling learning algorithms towards AI
TLDR
It is argued that deep architectures have the potential to generalize in non-local ways, i.e., beyond immediate neighbors, and that this is crucial in order to make progress on the kind of complex tasks required for artificial intelligence.
Visualizing and Understanding Convolutional Networks
TLDR
A novel visualization technique is introduced that gives insight into the function of intermediate feature layers and the operation of the classifier in large Convolutional Network models, used in a diagnostic role to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark.
Large-scale deep unsupervised learning using graphics processors
TLDR
It is argued that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods.
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning
TLDR
This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
Maxout Networks
TLDR
A simple new model called maxout is defined designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique.
A dynamically configurable coprocessor for convolutional neural networks
TLDR
This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.
Best practices for convolutional neural networks applied to visual document analysis
TLDR
A set of concrete bestpractices that document analysis researchers can use to get good results with neural networks, including a simple "do-it-yourself" implementation of convolution with a flexible architecture suitable for many visual document problems.
Distributed GraphLab: A Framework for Machine Learning in the Cloud
TLDR
This paper develops graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency, and introduces fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm.
ImageNet: A large-scale hierarchical image database
TLDR
A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
...
...