• Corpus ID: 1507815

# MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

@article{Chen2015MXNetAF,
title={MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems},
author={Tianqi Chen and Mu Li and Yutian Li and Min Lin and Naiyan Wang and Minjie Wang and Tianjun Xiao and Bing Xu and Chiyuan Zhang and Zheng Zhang},
journal={ArXiv},
year={2015},
volume={abs/1512.01274}
}
• Published 3 December 2015
• Computer Science
• ArXiv
MXNet is a multi-language machine learning (ML) library to ease the development of ML algorithms, especially for deep neural networks. [] Key Result Our preliminary experiments reveal promising results on large scale deep neural network applications using multiple GPU machines.
1,994 Citations

## Figures and Tables from this paper

### Parallax: Automatic Data-Parallel Training of Deep Neural Networks

• Computer Science
ArXiv
• 2018
Parallax is introduced, a tool for automatic parallelization of deep learning training in distributed environments that handles the subtle correctness issues, but also leverages various optimizations to minimize the communication overhead caused by scaling out.

### Hyper: Distributed Cloud Processing for Large-Scale Deep Learning Tasks

• Davit Buniatyan
• Computer Science
2019 Computer Science and Information Technologies (CSIT)
• 2019
A hybrid distributed cloud framework with a unified view to multiple clouds and an on-premise infrastructure for processing tasks using both CPU and GPU compute instances at scale, independent of the language and Deep Learning framework used.

### A Comparison of Distributed Machine Learning Platforms

• Computer Science
2017 26th International Conference on Computer Communication and Networks (ICCCN)
• 2017
This work studies Spark as a representative dataflow system, PMLS as a parameter- server system, and TensorFlow and MXNet as examples of more advanced dataflow systems, and analyzes the communication and control bottlenecks for these approaches.

### MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster

• Computer Science
ArXiv
• 2018
This paper develops and implements a novel job server parallel software framework, named by "\textit{MiMatrix}", for distributed deep learning training, and proposes a novel GPUDirect Remote direct memory access~(RDMA)-aware parallel algorithm of AllReucde executed by computing servers.

### TensorFlow: A system for large-scale machine learning

• Computer Science
OSDI
• 2016
The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.

### Auto-Parallelizing Deep Learning for Multi-machine , Multi-GPU Environments

• Computer Science
• 2017
Parallax is introduced, an auto-parallelization module that helps machine learning researchers extend their single-model code to operate in data parallelism with multiGPU and multi-machine, as well as several extensions on Parallax, including the application of model parallelism strategies to boost performance for models with relatively large parameters.

### ParallelNAS: A Parallel and Distributed System for Neural Architecture Search

• Computer Science
2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
• 2020
A two-level hierarchical parallel system, comprising a parallel explorer and parallel evaluators, built on virtualized, massively-parallel, asynchronous infrastructure, that can reach near-linear speedups on a cluster of 64 GPUs.

### A Novel Co-design Peta-scale Heterogeneous Cluster for Deep Learning Training

• Computer Science
• 2018
The central node of MiMatrix, referred to as the job server, undertakes all of controlling, scheduling and monitoring, and I/O tasks without weight data transfer for AllReduce processing in each iteration, thereby solving the bandwidth bottleneck of central node in parameter server framework that is widely used in distributed DL tasks.

### Enabling Fast and Flexible Distributed Deep Learning with Programmable Switches

• Computer Science
• 2022
Libra, a network aggregator, that utilizes in-network computation to optimize the communication for distributed DL training in two as-pects: i) reduce active connections and ii) aggregate exchanged network packets is designed and implemented.

### Triton: an intermediate language and compiler for tiled neural network computations

• Computer Science
MAPL@PLDI
• 2019
Triton is presented, a language and compiler centered around the concept of tile, i.e., statically shaped multi-dimensional sub-arrays for expressing tensor programs in terms of operations on parametric tile variables and a set of novel tile-level optimization passes for compiling these programs into efficient GPU code.

## References

SHOWING 1-10 OF 13 REFERENCES

### Communication Efficient Distributed Machine Learning with the Parameter Server

• Computer Science
NIPS
• 2014
An in-depth analysis of two large scale machine learning problems ranging from l1 -regularized logistic regression on CPUs to reconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of samples and dimensions is presented.

### Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning

• Computer Science
• 2014
Minerva proposes a matrix-based API, resulting in compact codes and the Matlab-like, imperative and procedural coding style, and provides language flexibility and execution efficiency simultaneously within one coherent framework.

### Torch7: A Matlab-like Environment for Machine Learning

• Computer Science
NIPS 2011
• 2011
Torch7 is a versatile numeric computing framework and machine learning library that extends Lua that can easily be interfaced to third-party software thanks to Lua’s light interface.

### Purine: A bi-graph based deep learning framework

• Computer Science
ICLR
• 2015
It is demonstrated that different parallelism schemes over GPUs and/or CPUs on single or multiple PCs can be universally implemented by graph composition, which eases researchers from coding for various parallelization schemes, and the same dispatcher can be used for solving variant graphs.

### Large Scale Distributed Deep Networks

• Computer Science
NIPS
• 2012
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

### Caffe: Convolutional Architecture for Fast Feature Embedding

• Computer Science
ACM Multimedia
• 2014
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

### Scaling Distributed Machine Learning with the Parameter Server

• Computer Science
BigDataScience '14
• 2014
View on new challenges identified are shared, and some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines are covered.

### Theano: new features and speed improvements

• Computer Science
ArXiv
• 2012
New features and efficiency improvements to Theano are presented, and benchmarks demonstrating Theano's performance relative to Torch7, a recently introduced machine learning library, and to RNNLM, a C++ library targeted at recurrent neural networks.

### Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

• Computer Science
ICML
• 2015
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

### ImageNet Large Scale Visual Recognition Challenge

• Computer Science
International Journal of Computer Vision
• 2015
The creation of this benchmark dataset and the advances in object recognition that have been possible as a result are described, and the state-of-the-art computer vision accuracy with human accuracy is compared.