• Corpus ID: 5707386

# TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

@article{Abadi2016TensorFlowLM,
title={TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems},
author={Mart{\'i}n Abadi and Ashish Agarwal and Paul Barham and Eugene Brevdo and Z. Chen and Craig Citro and Gregory S. Corrado and Andy Davis and Jeffrey Dean and Matthieu Devin and Sanjay Ghemawat and Ian J. Goodfellow and Andrew Harp and Geoffrey Irving and Michael Isard and Yangqing Jia and Rafal J{\'o}zefowicz and Lukasz Kaiser and Manjunath Kudlur and Josh Levenberg and Dandelion Man{\'e} and Rajat Monga and Sherry Moore and Derek Gordon Murray and Christopher Olah and Mike Schuster and Jonathon Shlens and Benoit Steiner and Ilya Sutskever and Kunal Talwar and Paul A. Tucker and Vincent Vanhoucke and Vijay Vasudevan and Fernanda B. Vi{\'e}gas and Oriol Vinyals and Pete Warden and Martin Wattenberg and Martin Wicke and Yuan Yu and Xiaoqiang Zheng},
journal={ArXiv},
year={2016},
volume={abs/1603.04467}
}
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of…
8,938 Citations
TensorFlow: A system for large-scale machine learning
The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.
Operator Vectorization Library – A TensorFlow Plugin
TensorFlow is an interface for implementing machine learning applications that can be accelerated by using Graphics Processing Units (GPUs). It is rapidly becoming a standard tool in this space.
Improving the Performance of Distributed TensorFlow with RDMA
This work presents a RDMA-capable design of TensorFlow, which shows a great scalability among the training scale and gets nearly 6$$\times$$× performance improvements over the original distributed Tensor Flow, based on gRPC.
TensorX: Extensible API for Neural Network Model Design and Deployment
• Computer Science
ArXiv
• 2020
TensorX is a Python library for prototyping, design, and deployment of complex neural network models in TensorFlow, aiming to make available high-level components like neural network layers that are, in effect, stateful functions, easy to compose and reuse.
SingleCaffe: An Efficient Framework for Deep Learning on a Single Node
SingleCaffe is presented, a DL framework that can make full use of hardware equipped with high computing power and improve the computational efficiency of the training process and the experimental results show that SingleCaffe can improve training efficiency well.
A Tour of TensorFlow
This paper reviews TensorFlow and puts it in context of modern deep learning concepts and software, its basic computational paradigms and distributed execution model, its programming interface as well as accompanying visualization toolkits and comment on observed use-cases of Tensor Flow in academia and industry.
DLVM : A MODERN COMPILER INFRASTRUCTURE FOR DEEP LEARNING
Many current approaches to deep learning make use of high-level toolkits such as TensorFlow, Torch, or Caffe. Toolkits such as Caffe have a layer-based programming framework with hard-coded gradients
Increasing Portable Machine Learning Performance by Application of Rewrite Rules on Google Tensorflow Data Flow Graphs
Machine Learning is an important field that is usually limited by execution performance. The approach commonly used to solve this problem is to make use of parallelism provided by hardware such as
Scalability Study of Deep Learning Algorithms in High Performance Computer Infrastructures
This project show how the training of a state-of-the-art neural network for computer vision can be parallelized on a distributed GPU cluster, Minotauro GPU cluster from Barcelona Supercomputing Center with the TensorFlow framework.
In-Database Machine Learning: Gradient Descent and Tensor Algebra for Main Memory Database Systems
• Computer Science
BTW
• 2019
This work aims to incorporate gradient descent and tensor data types into database systems, allowing them to handle a wider range of computational tasks, and implements tensor algebra and stochastic gradient descent using lambda expressions for loss functions as a pipelined operator in a main memory database system.

## References

SHOWING 1-10 OF 67 REFERENCES
Project Adam: Building an Efficient and Scalable Deep Learning Training System
• Computer Science
OSDI
• 2014
The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.
Large Scale Distributed Deep Networks
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Caffe: Convolutional Architecture for Fast Feature Embedding
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
An introduction to computational networks and the computational network toolkit (invited talk)
The computational network toolkit (CNTK), an implementation of CN that supports both GPU and CPU, is introduced and the architecture and the key components of the CNTK are described, the command line options to use C NTK, and the network definition and model editing language are described.
Building high-level features using large scale unsupervised learning
• Quoc V. Le, +5 authors A. Ng
• Computer Science, Mathematics
2013 IEEE International Conference on Acoustics, Speech and Signal Processing
• 2013
Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.
Multilingual acoustic models using distributed deep neural networks
• G. Heigold, +4 authors J. Dean
• Computer Science
2013 IEEE International Conference on Acoustics, Speech and Signal Processing
• 2013
Experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total show average relative gains over the monolingual baselines, but additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks.
cuDNN: Efficient Primitives for Deep Learning
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.
On rectified linear units for speech processing
This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.
Sequence to Sequence Learning with Neural Networks
• Computer Science
NIPS
• 2014
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
• Computer Science
ICML
• 2015
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.