CHAOS: a parallelization scheme for training convolutional neural networks on Intel Xeon Phi

  title={CHAOS: a parallelization scheme for training convolutional neural networks on Intel Xeon Phi},
  author={Andre Viebke and Suejb Memeti and Sabri Pllana and Ajith Abraham},
  journal={The Journal of Supercomputing},
Deep learning is an important component of Big Data analytic tools and intelligent applications, such as self-driving cars, computer vision, speech recognition, or precision medicine. However, the training process is computationally intensive and often requires a large amount of time if performed sequentially. Modern parallel computing systems provide the capability to reduce the required training time of deep neural networks. In this paper, we present our parallelization scheme for training… 

DAPP: Accelerating Training of DNN

  • S. SapnaN. S. SreenivasaluK. Paul
  • Computer Science
    2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
  • 2018
This paper presents an acceleration technique DAPP Accelerating Training of DNN using Ping-Pong approach to reduce the training time using distributed local memory and adapt for multi-core architectures.

Demystifying Parallel and Distributed Deep Learning

The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated.

Scaling Analysis of Specialized Tensor Processing Architectures for Deep Learning Models

These results give the precise estimation of the higher performance (throughput) of TPAs as Google TPUv2 in comparison to GPU for the large number of computations under conditions of low overhead calculations and high utilization of TPU units by means of the large image and batch sizes.

Iteration Time Prediction for CNN in Multi-GPU Platform: Modeling and Analysis

This paper introduces a framework to analyze the training time for convolutional neural networks (CNNs) on multi-GPU platforms and decomposes the model and obtains accurate prediction results without long-term training or complex data collection.

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis.

The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated.

A systematic literature review on hardware implementation of artificial intelligence algorithms

This work presents a systematic literature review that focuses on exploring the available hardware accelerators for the AI and ML tools, using FPGAs, GPUs and ASICs to accelerate computationally intensive tasks.

Bayesian Neural Networks at Scale: A Performance Analysis and Pruning Study

This work explores the use of high performance computing with distributed training to address the challenges of training BNNs at scale and demonstrates that network pruning can speed up inference without accuracy loss.


The reader will find methodological foundations behind convolutional neural networks, a description of a data set for building such models, an example of construction of a convolved neural network model for classification of dermatoscopic images using TensorFlow and Keras libraries in Python as well as recommendations on how to present the results of building convolutionAL neural networks.

Big Data Analysis and Prediction System Based on Improved Convolutional Neural Network

It is proven that the convolutional neural network has faster training speed and higher accuracy, and this network structure can effectively improve the training speed of the network and improve the accuracy of thenetwork.

Programming Languages for Data-Intensive HPC Applications: a Systematic Literature Review

The results indicate that, for instance, the majority of the used HPC languages in the context of Big Data are text-based general-purpose programming languages and target the end-user community.



Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

A many- core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine and suggests that theIntel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU.

Accelerating Large-Scale Convolutional Neural Networks with Parallel Graphics Multiprocessors

This work has adapted the inherent multi-level parallelism of CNNs for Nvidia's CUDA GPU architecture to accelerate the training by two orders of magnitude, allowing to apply CNN architectures to pattern recognition tasks on datasets with high-resolution natural images.

Accelerating pattern matching in neuromorphic text recognition system using Intel Xeon Phi coprocessor

From a scalability standpoint on a High Performance Computing (HPC) platform it is shown that efficient workload partitioning and resource management can double the performance of this many-core architecture for neuromorphic applications.

A snapshot of image pre-processing for convolutional neural networks: case study of MNIST

This paper shows and analyzes the impact of different preprocessing techniques on the performance of three CNNs, LeNet, Network3 and DropConnect, together with their ensembles and demonstrates that data-preprocessing techniques, such as the combination of elastic deformation and rotation,together with ensembled have a high potential to further improve the state-of-the-art accuracy in MNIST classification.

High Performance Convolutional Neural Networks for Document Processing

Three novel approaches to speeding up CNNs are presented: a) unrolling convolution, b) using BLAS (basic linear algebra subroutines), and c) using GPUs (graphic processing units).

Benchmarking State-of-the-Art Deep Learning Software Tools

This paper presents an attempt to benchmark several state-of-the-art GPU-accelerated deep learning software tools, including Caffe, CNTK, TensorFlow, and Torch, and focuses on evaluating the running time performance of these tools with three popular types of neural networks on two representative CPU platforms and three representative GPU platforms.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

High-Performance Neural Networks for Visual Object Classification

We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a

Comparative Study of Deep Learning Software Frameworks

A comparative study of five deep learning frameworks, namely Caffe, Neon, TensorFlow, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed finds that Theano and Torch are the most easily extensible frameworks.

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.