Project Adam: Building an Efficient and Scalable Deep Learning Training System

Abstract

Large deep neural network models have recently demonstrated state-of-the-art accuracy on hard visual recognition tasks. Unfortunately such models are extremely time consuming to train and require large amount of compute cycles. We describe the design and implementation of a distributed system called Adam comprised of commodity server machines to train such models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks. Adam achieves high efficiency and scalability through whole system co-design that optimizes and balances workload computation and communication. We exploit asynchrony throughout the system to improve performance and show that it additionally improves the accuracy of trained models. Adam is significantly more efficient and scalable than was previously thought possible and used 30x fewer machines to train a large 2 billion connection model to 2x higher accuracy in comparable time on the ImageNet 22,000 category image classification task than the system that previously held the record for this benchmark. We also show that task accuracy improves with larger models. Our results provide compelling evidence that a distributed systems-driven approach to deep learning using current training algorithms is worth pursuing.

Extracted Key Phrases

14 Figures and Tables

0502014201520162017
Citations per Year

203 Citations

Semantic Scholar estimates that this publication has 203 citations based on the available data.

See our FAQ for additional information.

Cite this paper

@inproceedings{Chilimbi2014ProjectAB, title={Project Adam: Building an Efficient and Scalable Deep Learning Training System}, author={Trishul M. Chilimbi and Yutaka Suzue and Johnson Apacible and Karthik Kalyanaraman}, booktitle={OSDI}, year={2014} }