• Corpus ID: 6287870

TensorFlow: A system for large-scale machine learning

@article{Abadi2016TensorFlowAS,
  title={TensorFlow: A system for large-scale machine learning},
  author={Mart{\'i}n Abadi and Paul Barham and Jianmin Chen and Z. Chen and Andy Davis and Jeffrey Dean and Matthieu Devin and Sanjay Ghemawat and Geoffrey Irving and Michael Isard and Manjunath Kudlur and Josh Levenberg and Rajat Monga and Sherry Moore and Derek Gordon Murray and Benoit Steiner and Paul A. Tucker and Vijay Vasudevan and Pete Warden and Martin Wicke and Yuan Yu and Xiaoqiang Zhang},
  journal={ArXiv},
  year={2016},
  volume={abs/1605.08695}
}
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives… 

Figures and Tables from this paper

Performance Analysis of Just-in-Time Compilation for Training TensorFlow Multi-Layer Perceptrons
TLDR
The performance of Just-In-Time (JIT) compilation in TensorFlow is investigated for the relatively straightforward use-case of training Multi-Layer Perceptrons (MLPs) by employing performance analysis, which aims to develop an understanding of when JIT compilation may be beneficial for performance, which could then be used to enable or disableJIT compilation in future program executions.
Improving the Performance of Distributed TensorFlow with RDMA
TLDR
This work presents a RDMA-capable design of TensorFlow, which shows a great scalability among the training scale and gets nearly 6$$\times $$× performance improvements over the original distributed Tensor Flow, based on gRPC.
TensorBow: Supporting Small-Batch Training in TensorFlow
TLDR
The main challenges in implementing TensorBow are related to the fact that many TensorFlow components and abstractions are designed under the assumption of training a single model replica per GPU, making them unsafe for concurrent use, and extended those components to safely train multiple model replicas per GPU.
A Performance Evaluation of Distributed TensorFlow
TLDR
From the experimental results, it is confirmed that TensorFlow can accelerate the execution in all the environment it is tested, and it is found that the mini batch size has a big influence in the distributed environment of 1Gbps network.
Benchmarking TensorFlow on a personal computer not specialised for machine learning
TLDR
This study benchmark and investigate the performance of TensorFlow in terms of images per second on a personal computer not specialised for machine learning, and concludes that improving the GPU, rather than the CPU, has greater potential for improving performance.
Towards Evaluation of Tensorflow Performance in a Distributed Compute Environment
TLDR
The results show that with the right choice of input parameters and appropriate hardware, GPU-equipped general-purpose compute clusters can provide comparable deep learning training performance to specialized machines designed for AI workloads.
Tensor Relational Algebra for Distributed Machine Learning System Design
TLDR
The TRA is a set-based algebra based on the relational algebra that is easily executed with high efficiency in a parallel or distributed environment, and amenable to automatic optimization.
EasyDist: An End-to-End Distributed Deep Learning Tool for Cloud
TLDR
EasyDist is an end-to-end DDL tool that preserves the single-node programming model by leveraging distributed TensorFlow between a Keras interface and public Cloud infrastructure and Evaluation of EasyDist on publicly available benchmark datasets and models shows that the model accuracy is not compromised and the training times can be reduced upto ~6-8x compared to single machine settings.
Fast Distributed Deep Learning over RDMA
TLDR
It is shown that RPC is suboptimal for distributed deep learning computation, especially on an RDMA-capable network, and the graph analyzer looks at both the data flow graph and the tensors to optimize memory allocation and remote data access using this interface.
TensorLayer: A Versatile Library for Efficient Deep Learning Development
TLDR
TensorLayer is a Python-based versatile deep learning library that provides high-level modules that abstract sophisticated operations towards neuron layers, network models, training data and dependent training jobs and has transparent module interfaces that allows developers to flexibly embed low-level controls within a backend engine.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 110 REFERENCES
SparkNet: Training Deep Networks in Spark
TLDR
This work introduces SparkNet, a framework for training deep networks in Spark using a simple parallelization scheme for stochastic gradient descent that scales well with the cluster size and tolerates very high-latency communication.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
TLDR
The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.
Project Adam: Building an Efficient and Scalable Deep Learning Training System
TLDR
The design and implementation of a distributed system called Adam comprised of commodity server machines to train large deep neural network models that exhibits world-class performance, scaling and task accuracy on visual recognition tasks and shows that task accuracy improves with larger models.
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
TLDR
The API design and the system implementation of MXNet are described, and it is explained how embedding of both symbolic expression and tensor operation is handled in a unified fashion.
Large Scale Distributed Deep Networks
TLDR
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Caffe: Convolutional Architecture for Fast Feature Embedding
TLDR
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.
Building high-level features using large scale unsupervised learning
TLDR
Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.
Theano: A Python framework for fast computation of mathematical expressions
TLDR
The performance of Theano is compared against Torch7 and TensorFlow on several machine learning models and recently-introduced functionalities and improvements are discussed.
GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server
TLDR
GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
...
1
2
3
4
5
...