• Corpus ID: 5707386

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

  title={TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems},
  author={Mart{\'i}n Abadi and Ashish Agarwal and Paul Barham and Eugene Brevdo and Z. Chen and Craig Citro and Gregory S. Corrado and Andy Davis and Jeffrey Dean and Matthieu Devin and Sanjay Ghemawat and Ian J. Goodfellow and Andrew Harp and Geoffrey Irving and Michael Isard and Yangqing Jia and Rafal J{\'o}zefowicz and Lukasz Kaiser and Manjunath Kudlur and Josh Levenberg and Dandelion Man{\'e} and Rajat Monga and Sherry Moore and Derek Gordon Murray and Christopher Olah and Mike Schuster and Jonathon Shlens and Benoit Steiner and Ilya Sutskever and Kunal Talwar and Paul A. Tucker and Vincent Vanhoucke and Vijay Vasudevan and Fernanda B. Vi{\'e}gas and Oriol Vinyals and Pete Warden and Martin Wattenberg and Martin Wicke and Yuan Yu and Xiaoqiang Zheng},
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of… 

TensorFlow: A system for large-scale machine learning

The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.

Operator Vectorization Library – A TensorFlow Plugin

The OVL optimizer provides automated kernel fusion of OVL operators at runtime and is described, showing a 2.3x speed-up over a pure TensorFlow implementation.

TensorX: Extensible API for Neural Network Model Design and Deployment

TensorX is a Python library for prototyping, design, and deployment of complex neural network models in TensorFlow, aiming to make available high-level components like neural network layers that are, in effect, stateful functions, easy to compose and reuse.

SingleCaffe: An Efficient Framework for Deep Learning on a Single Node

SingleCaffe is presented, a DL framework that can make full use of hardware equipped with high computing power and improve the computational efficiency of the training process and the experimental results show that SingleCaffe can improve training efficiency well.


This work introduces a novel framework for high-level programming that addresses all of the above shortcomings of toolkits and allows users’ abilities to make use of heterogeneous and emerging hardware architectures.

Scalability Study of Deep Learning Algorithms in High Performance Computer Infrastructures

This project show how the training of a state-of-the-art neural network for computer vision can be parallelized on a distributed GPU cluster, Minotauro GPU cluster from Barcelona Supercomputing Center with the TensorFlow framework.

Increasing Portable Machine Learning Performance by Application of Rewrite Rules on Google Tensorflow Data Flow Graphs

An attempt is made to convert Google Tensorflow segments into a functional representation of the Lift programming language, which allows multiple operations to be combined in order to reduce unnecessary overhead required for calling multiple kernels.

In-Database Machine Learning: Gradient Descent and Tensor Algebra for Main Memory Database Systems

This work aims to incorporate gradient descent and tensor data types into database systems, allowing them to handle a wider range of computational tasks, and implements tensor algebra and stochastic gradient descent using lambda expressions for loss functions as a pipelined operator in a main memory database system.

TensorBow: Supporting Small-Batch Training in TensorFlow

The main challenges in implementing TensorBow are related to the fact that many TensorFlow components and abstractions are designed under the assumption of training a single model replica per GPU, making them unsafe for concurrent use, and extended those components to safely train multiple model replicas per GPU.

A Comparison of Distributed Machine Learning Platforms

This work studies Spark as a representative dataflow system, PMLS as a parameter- server system, and TensorFlow and MXNet as examples of more advanced dataflow systems, and analyzes the communication and control bottlenecks for these approaches.



Project Adam: Building an Efficient and Scalable Deep Learning Training System

The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

An introduction to computational networks and the computational network toolkit (invited talk)

The computational network toolkit (CNTK), an implementation of CN that supports both GPU and CPU, is introduced and the architecture and the key components of the CNTK are described, the command line options to use C NTK, and the network definition and model editing language are described.

Building high-level features using large scale unsupervised learning

  • Quoc V. LeM. Ranzato A. Ng
  • Computer Science
    2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.

On rectified linear units for speech processing

This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.

Sequence to Sequence Learning with Neural Networks

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

Pointer Networks

A new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence using a recently proposed mechanism of neural attention, called Ptr-Nets, which improves over sequence-to-sequence with input attention, but also allows it to generalize to variable size output dictionaries.

Dandelion: a compiler and runtime for heterogeneous systems

Dandelion automatically and transparently distributes data-parallel portions of a program to available computing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution.