• Corpus ID: 1507815

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

@article{Chen2015MXNetAF,
  title={MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems},
  author={Tianqi Chen and Mu Li and Yutian Li and Min Lin and Naiyan Wang and Minjie Wang and Tianjun Xiao and Bing Xu and Chiyuan Zhang and Zheng Zhang},
  journal={ArXiv},
  year={2015},
  volume={abs/1512.01274}
}
MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. [] Key Result Our preliminary experiments reveal promising results on large scale deep neural network applications using multiple GPU machines.

Figures and Tables from this paper

Parallax: Automatic Data-Parallel Training of Deep Neural Networks

Parallax is introduced, a tool for automatic parallelization of deep learning training in distributed environments that handles the subtle correctness issues, but also leverages various optimizations to minimize the communication overhead caused by scaling out.

Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

  • Davit Buniatyan
  • Computer Science
    2019 Computer Science and Information Technologies (CSIT)
  • 2019
A hybrid distributed cloud framework with a unified view to multiple clouds and an on-premise infrastructure for processing tasks using both CPU and GPU compute instances at scale, independent of the language and Deep Learning framework used.

A Comparison of Distributed Machine Learning Platforms

This work studies Spark as a representative dataflow system, PMLS as a parameter- server system, and TensorFlow and MXNet as examples of more advanced dataflow systems, and analyzes the communication and control bottlenecks for these approaches.

MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster

This paper develops and implements a novel job server parallel software framework, named by "\textit{MiMatrix}", for distributed deep learning training, and proposes a novel GPUDirect Remote direct memory access~(RDMA)-aware parallel algorithm of AllReucde executed by computing servers.

TensorFlow: A system for large-scale machine learning

The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.

Auto-Parallelizing Deep Learning for Multi-machine , Multi-GPU Environments

Parallax is introduced, an auto-parallelization module that helps machine learning researchers extend their single-model code to operate in data parallelism with multiGPU and multi-machine, as well as several extensions on Parallax, including the application of model parallelism strategies to boost performance for models with relatively large parameters.

ParallelNAS: A Parallel and Distributed System for Neural Architecture Search

  • Xiaoyang QuJianzong WangJing Xiao
  • Computer Science
    2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2020
A two-level hierarchical parallel system, comprising a parallel explorer and parallel evaluators, built on virtualized, massively-parallel, asynchronous infrastructure, that can reach near-linear speedups on a cluster of 64 GPUs.

A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training

The central node of MiMatrix, referred to as the job server, undertakes all of controlling, scheduling and monitoring, and I/O tasks without weight data transfer for AllReduce processing in each iteration, thereby solving the bandwidth bottleneck of central node in parameter server framework that is widely used in distributed DL tasks.

Enabling Fast and Flexible Distributed Deep Learning with Programmable Switches

Libra, a network aggregator, that utilizes in-network computation to optimize the communication for distributed DL training in two as-pects: i) reduce active connections and ii) aggregate exchanged network packets is designed and implemented.

Triton: an intermediate language and compiler for tiled neural network computations

Triton is presented, a language and compiler centered around the concept of tile, i.e., statically shaped multi-dimensional sub-arrays for expressing tensor programs in terms of operations on parametric tile variables and a set of novel tile-level optimization passes for compiling these programs into efficient GPU code.
...

References

SHOWING 1-10 OF 13 REFERENCES

Communication Efficient Distributed Machine Learning with the Parameter Server

An in-depth analysis of two large scale machine learning problems ranging from l1 -regularized logistic regression on CPUs to reconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of samples and dimensions is presented.

Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning

Minerva proposes a matrix-based API, resulting in compact codes and the Matlab-like, imperative and procedural coding style, and provides language flexibility and execution efficiency simultaneously within one coherent framework.

Torch7: A Matlab-like Environment for Machine Learning

Torch7 is a versatile numeric computing framework and machine learning library that extends Lua that can easily be interfaced to third-party software thanks to Lua’s light interface.

Purine: A bi-graph based deep learning framework

It is demonstrated that different parallelism schemes over GPUs and/or CPUs on single or multiple PCs can be universally implemented by graph composition, which eases researchers from coding for various parallelization schemes, and the same dispatcher can be used for solving variant graphs.

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

Scaling Distributed Machine Learning with the Parameter Server

View on new challenges identified are shared, and some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines are covered.

Theano: new features and speed improvements

New features and efficiency improvements to Theano are presented, and benchmarks demonstrating Theano's performance relative to Torch7, a recently introduced machine learning library, and to RNNLM, a C++ library targeted at recurrent neural networks.

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

ImageNet Large Scale Visual Recognition Challenge

The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.