The Potential of the Intel (R) Xeon Phi for Supervised Deep Learning

@article{Viebke2015ThePO,
  title={The Potential of the Intel (R) Xeon Phi for Supervised Deep Learning},
  author={Andre Viebke and Sabri Pllana},
  journal={2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems},
  year={2015},
  pages={758-765}
}
  • Andre Viebke, S. Pllana
  • Published 30 June 2015
  • Computer Science
  • 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems
Supervised learning of Convolutional Neural Networks (CNNs), also known as supervised Deep Learning, is a computationally demanding process. To find the most suitable parameters of a network for a given application, numerous training sessions are required. Therefore, reducing the training time per session is essential to fully utilize CNNs in practice. While numerous research groups have addressed the training of CNNs using GPUs, so far not much attention has been paid to the Intel Xeon Phi… Expand
Heterogeneous acceleration for CNN training with many integrated core
TLDR
This paper proposes to accelerate the computation of gradients in the convolutional layer by CPU+MIC heterogeneous computing technique, and evaluates the time costs of computing the gradients of all layers in the Caffe framework, and found that the convolved layer occupies the overall computational overheads. Expand
Deep Convolutional Network evaluation on the Intel Xeon Phi: Where Subword Parallelism meets Many-Core
TLDR
This thesis presents the evaluation of a novel ConvNet for road speed sign detection on a breakthrough 57-core Intel Xeon Phi processor with 512-bit vector support, and demonstrates that the parallelism inherent in the ConvNet algorithm can be effectively exploited by the 512- bit vector ISA and by utilizing the many core paradigm. Expand
Benchmarking State-of-the-Art Deep Learning Software Tools
TLDR
This paper presents an attempt to benchmark several state-of-the-art GPU-accelerated deep learning software tools, including Caffe, CNTK, TensorFlow, and Torch, and focuses on evaluating the running time performance of these tools with three popular types of neural networks on two representative CPU platforms and three representative GPU platforms. Expand
Parallel Computing in DNNs Using CPU and MIC
TLDR
This paper speed up the training process of DNNs applied for automatic speech recognition with CPU+MIC architecture with several optimization methods for I/O and calculation and results show that the optimized algorithm acquires about 20x speedup compared with the original sequential algorithm on CPU which uses one core. Expand
Scalable training of 3D convolutional networks on multi- and many-cores
TLDR
This novel parallel algorithm is based on decomposition into a set of tasks, most of which are convolutions or FFTs, which can be either faster or slower than certain GPU implementations depending on specifics of the network architecture, kernel sizes, and density and size of the output patch. Expand
Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
  • S. Shi, Xiaowen Chu
  • Computer Science
  • 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)
  • 2018
TLDR
This study evaluates the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over single-GPU, multi- GPU, and multi-node environments and identifies bottlenecks and overheads which could be further optimized. Expand
Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters
TLDR
An in-depth performance characterization of state-of-the-art DNNs such as ResNet(s) and Inception-v3/v4 on multiple CPU architectures including Intel Xeon Broadwell, three variants of the Intel Xeon Skylake, AMD EPYC, and NVIDIA GPUs like K80, P100, and V100 is provided. Expand
Accelerating Deep Learning with a Parallel Mechanism Using CPU + MIC
TLDR
This paper speed up the training of DNNs applied for automatic speech recognition and the target architecture is heterogeneous (CPU + MIC) and applies asynchronous methods for I/O and communication operations and proposes an adaptive load balancing method. Expand
ZNN -- A Fast and Scalable Algorithm for Training 3D Convolutional Networks on Multi-core and Many-Core Shared Memory Machines
TLDR
This work proposes a novel parallel algorithm based on decomposition into a set of tasks, most of which are convolutions or FFTs, which can attain speedup roughly equal to the number of physical cores within the PRAM model of parallel computation. Expand
Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for High-Performance Deep Learning on Frontera
TLDR
The potential of Frontera for training state-of-the-art Deep Learning models at scale is explored and insights into process per node and batch size configurations for TensorFlow as well as for PyTorch and MXNet are provided. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 42 REFERENCES
Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor
TLDR
A many- core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine and suggests that theIntel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU. Expand
High Performance Convolutional Neural Networks for Document Processing
TLDR
Three novel approaches to speeding up CNNs are presented: a) unrolling convolution, b) using BLAS (basic linear algebra subroutines), and c) using GPUs (graphic processing units). Expand
High-Performance Neural Networks for Visual Object Classification
We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in aExpand
MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures
Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. InExpand
Accelerating pattern matching in neuromorphic text recognition system using Intel Xeon Phi coprocessor
TLDR
From a scalability standpoint on a High Performance Computing (HPC) platform it is shown that efficient workload partitioning and resource management can double the performance of this many-core architecture for neuromorphic applications. Expand
Multi-column deep neural network for traffic sign classification
TLDR
This work uses a fast, fully parameterizable GPU implementation of a Deep Neural Network (DNN) that does not require careful design of pre-wired feature extractors, which are rather learned in a supervised way. Expand
Deep Learning: Methods and Applications
TLDR
This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning. Expand
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
TLDR
This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. Expand
High-level Support for Hybrid Parallel Execution of C++ Applications Targeting Intel® Xeon Phi™ Coprocessors
TLDR
A new high-level parallel library construct is presented which makes it easy to apply a function to every member of an array in parallel and supports the dynamic distribution of work between the host CPUs and one or more coprocessors. Expand
One weird trick for parallelizing convolutional neural networks
I present a new way to parallelize the training of convolutional neural networks across multiple GPUs. The method scales significantly better than all alternatives when applied to modernExpand
...
1
2
3
4
5
...