• Corpus ID: 6287870

# TensorFlow: A system for large-scale machine learning

@article{Abadi2016TensorFlowAS,
title={TensorFlow: A system for large-scale machine learning},
author={Mart{\'i}n Abadi and Paul Barham and Jianmin Chen and Z. Chen and Andy Davis and Jeffrey Dean and Matthieu Devin and Sanjay Ghemawat and Geoffrey Irving and Michael Isard and Manjunath Kudlur and Josh Levenberg and Rajat Monga and Sherry Moore and Derek Gordon Murray and Benoit Steiner and Paul A. Tucker and Vijay Vasudevan and Pete Warden and Martin Wicke and Yuan Yu and Xiaoqiang Zhang},
journal={ArXiv},
year={2016},
volume={abs/1605.08695}
}
• Published 27 May 2016
• Computer Science
• ArXiv
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives…
12,667 Citations

## Figures and Tables from this paper

Performance Analysis of Just-in-Time Compilation for Training TensorFlow Multi-Layer Perceptrons
• Computer Science
• 2018
The performance of Just-In-Time (JIT) compilation in TensorFlow is investigated for the relatively straightforward use-case of training Multi-Layer Perceptrons (MLPs) by employing performance analysis, which aims to develop an understanding of when JIT compilation may be beneficial for performance, which could then be used to enable or disableJIT compilation in future program executions.
Improving the Performance of Distributed TensorFlow with RDMA
• Computer Science
International Journal of Parallel Programming
• 2017
This work presents a RDMA-capable design of TensorFlow, which shows a great scalability among the training scale and gets nearly 6$$\times$$× performance improvements over the original distributed Tensor Flow, based on gRPC.
TensorBow: Supporting Small-Batch Training in TensorFlow
The main challenges in implementing TensorBow are related to the fact that many TensorFlow components and abstractions are designed under the assumption of training a single model replica per GPU, making them unsafe for concurrent use, and extended those components to safely train multiple model replicas per GPU.
A Performance Evaluation of Distributed TensorFlow
• Computer Science
• 2017
From the experimental results, it is confirmed that TensorFlow can accelerate the execution in all the environment it is tested, and it is found that the mini batch size has a big influence in the distributed environment of 1Gbps network.
Benchmarking TensorFlow on a personal computer not specialised for machine learning
• Computer Science
• 2018
This study benchmark and investigate the performance of TensorFlow in terms of images per second on a personal computer not specialised for machine learning, and concludes that improving the GPU, rather than the CPU, has greater potential for improving performance.
Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment
• Computer Science
TPCTC
• 2018
The results show that with the right choice of input parameters and appropriate hardware, GPU-equipped general-purpose compute clusters can provide comparable deep learning training performance to specialized machines designed for AI workloads.
Tensor Relational Algebra for Distributed Machine Learning System Design
• Computer Science
Proc. VLDB Endow.
• 2021
The TRA is a set-based algebra based on the relational algebra that is easily executed with high efficiency in a parallel or distributed environment, and amenable to automatic optimization.
EasyDist: An End-to-End Distributed Deep Learning Tool for Cloud
• Computer Science
• 2019
EasyDist is an end-to-end DDL tool that preserves the single-node programming model by leveraging distributed TensorFlow between a Keras interface and public Cloud infrastructure and Evaluation of EasyDist on publicly available benchmark datasets and models shows that the model accuracy is not compromised and the training times can be reduced upto ~6-8x compared to single machine settings.
Fast Distributed Deep Learning over RDMA
• Computer Science
EuroSys
• 2019
It is shown that RPC is suboptimal for distributed deep learning computation, especially on an RDMA-capable network, and the graph analyzer looks at both the data flow graph and the tensors to optimize memory allocation and remote data access using this interface.
TensorLayer: A Versatile Library for Efficient Deep Learning Development
• Computer Science
ACM Multimedia
• 2017
TensorLayer is a Python-based versatile deep learning library that provides high-level modules that abstract sophisticated operations towards neuron layers, network models, training data and dependent training jobs and has transparent module interfaces that allows developers to flexibly embed low-level controls within a backend engine.

## References

SHOWING 1-10 OF 111 REFERENCES
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
• Computer Science
ArXiv
• 2016
The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
SparkNet: Training Deep Networks in Spark
• Computer Science
ICLR
• 2016
This work introduces SparkNet, a framework for training deep networks in Spark using a simple parallelization scheme for stochastic gradient descent that scales well with the cluster size and tolerates very high-latency communication.
Project Adam: Building an Efficient and Scalable Deep Learning Training System
• Computer Science
OSDI
• 2014
The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
• Computer Science
ArXiv
• 2015
The API design and the system implementation of MXNet are described, and it is explained how embedding of both symbolic expression and tensor operation is handled in a unified fashion.
Large Scale Distributed Deep Networks
• Computer Science
NIPS
• 2012
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Caffe: Convolutional Architecture for Fast Feature Embedding
• Computer Science
ACM Multimedia
• 2014
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Building high-level features using large scale unsupervised learning
• Computer Science
2013 IEEE International Conference on Acoustics, Speech and Signal Processing
• 2013
Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.
Theano: A Python framework for fast computation of mathematical expressions
• Computer Science
ArXiv
• 2016
The performance of Theano is compared against Torch7 and TensorFlow on several machine learning models and recently-introduced functionalities and improvements are discussed.
GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server
• Computer Science
EuroSys
• 2016
GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.
Rethinking the Inception Architecture for Computer Vision
• Computer Science
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2016
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.